Using BioNano Maps to Improve an Insect Genome Assembly

Using BioNano Maps to Improve
an Insect Genome Assembly
!
!
!
!
!
!
!
!
!
Sue Brown
Oct 23, 2014
1

Tribolium castaneum, the red flour beetle
Genetic model organism for developmental, physiology and toxicology studies
!
• Easily cultured
• Short generation time
• Small genome size
• Molecular and visible marker genetic maps
• Genetic tools: balancers, deficiencies
• Genomic libraries: lambda and BAC
• cDNA libraries
• Mutant analysis and RNAi
• Transformation
• 7x Sanger draft genome (Nature, 2008)
2

Tribolium Genome
!
!
!
!
Genome size: 200 (Mb) Cot Analysis
9 autosomes, X and Y
Low methylation
Long period interspersion
!
!
!
!
!
!
Jeff Stuart, Purdue University
3

Molecular map markers used to anchor scaffolds to Chromosome builds
Few X markers, no Y, variable marker density
4

Tcas 3.0 Reference Genome stats from NCBI
Input file N50 (Mb) Number Cumulative
Length (Mb)
Genome contigs 0.04 8814 160.74
Genome scaffolds 0.98 481 152.53
Unmapped scaffolds 352
Unmapped contigs 1884
5

Algorithms and filters used to improve the Tribolium
draft Assembly with Physical Maps Based on
Imaging Ultra-Long Single DNA Molecules
!
Jennifer Shelton
2014
6

Data formats
BNX molecule 1
BNX - text file of
molecules
7

Data formats
BNX molecule 1
BNX - text file of
molecules
CMAP - text file of
consensus maps
7

Data formats
in silico CMAP -
from genome
FASTA
BNX molecule 1
in silico CMAP 1
BNX - text file of
molecules
CMAP - text file of
consensus maps
7

Data formats
in silico CMAP -
from genome
FASTA
BNG CMAP -
from assembled
molecules
BNX molecule 1
in silico CMAP 1
BNG CMAP 1
BNX - text file of
molecules
CMAP - text file of
consensus maps
7

Data formats
in silico CMAP -
from genome
FASTA
BNG CMAP -
from assembled
molecules
XMAP - text file of
alignment of two
CMAPs
BNX molecule 1
in silico CMAP 1
BNG CMAP 1
in silico CMAP 1 in silico CMAP 2
BNG CMAP 1 BNG CMAP 2
BNX - text file of
molecules
CMAP - text file of
consensus maps
7

Assembly Pipeline
BBNNXX BNX
scanBNX
4) QC graphs for
each flowcell
adjBNX
adjBNX
5) Merge all
flowcells
6) First
assemblies
Strict
p-value
threshold
Default
p-value
threshold
Assembly workflow:! !
1) The Irys produces tiff files that are converted into BNX text files.!
2) Each chip produces one BNX file for each of two flowcells.!
3) BNX files are split by scan and aligned to the sequence reference. Stretch (bases per pixel) is
3) use sequence reference to adjust molecule stretch for each scan
recalculated from the alignment.!
4) Quality check graphs are created for each pre-adjusted flowcell BNX.!
5) Adjusted flowcell BNXs are merged.!
6) The first assemblies are run with a variety of p-value thresholds.!
7) The best of the first assemblies (red oval) is chosen and a version of this assembly is produced
with a variety of minimum molecule length filters.
adjBNX
adjBNX
1) The Irys
produces tiff files
3) Scan BNX are
adjusted
7) Second
assemblies
Strict
minimum
molecule
length
Relaxed
minimum
molecule
length
mergeBNX
Relaxed
p-value
threshold
BBNNXX BNX
scanBNX
BBNNXX BNX
scanBNX
BBNNXX BNX
scanBNX
2) Each chip
produces flowcell
BNX files
BNX
BNX
BNX
BNX
8

Assembly Pipeline
In recent datasets when SNR is low and alignment is good we see a spike in
bases per pixel (bpp) in the first scan, a plateau and a lower plateau
First scan in a
flow cell
9

Assembly Pipeline
BBNNXX BNX
scanBNX
4) QC graphs for
each flowcell
adjBNX
adjBNX
5) Merge all
flowcells
6) First
assemblies
Strict
p-value
threshold
Default
p-value
threshold
5) Use sequence reference to determine assembly noise parameters.
Estimated recalculated from genome the alignment.!
size is used to set the p-value threshold.
adjBNX
adjBNX
1) The Irys
produces tiff files
3) Scan BNX are
adjusted
7) Second
assemblies
Strict
minimum
molecule
length
Relaxed
minimum
molecule
length
mergeBNX
Relaxed
p-value
threshold
BBNNXX BNX
scanBNX
BBNNXX BNX
scanBNX
BBNNXX BNX
scanBNX
2) Each chip
produces flowcell
BNX files
BNX
BNX
BNX
BNX
10

Assembly Pipeline
BBNNXX BNX
scanBNX
4) QC graphs for
each flowcell
adjBNX
adjBNX
5) Merge all
flowcells
6) First
assemblies
Strict
p-value
threshold
Default
p-value
threshold
6/7) Variants of the starting p-value and default minimum molecule length are
explored in nine assemblies.
recalculated from the alignment.!
adjBNX
adjBNX
1) The Irys
produces tiff files
3) Scan BNX are
adjusted
7) Second
assemblies
Strict
minimum
molecule
length
Relaxed
minimum
molecule
length
mergeBNX
Relaxed
p-value
threshold
BBNNXX BNX
scanBNX
BBNNXX BNX
scanBNX
BBNNXX BNX
scanBNX
2) Each chip
produces flowcell
BNX files
BNX
BNX
BNX
BNX
11

Current Tribolium sequence-based assembly
Input file N50 (Mb) Number
of
Scaffolds
Cumulative
Length (Mb)
Genome FASTA 1.16 2240 160.74
in silico CMAP from FASTA 1.20 223 152.53
223 scaffolds from the sequence-based assembly were longer than 20 (kb)
with more than 5 labels and were converted into in silico CMAPs
12

Assembly Results
Input file N50 (Mb) Number Cumulative
Length (Mb)
Genome FASTA 1.16 2240 160.74
in silico CMAP from FASTA 1.20 223 152.53
CMAP from assembled BNG
molecules (BNG CMAP)
1.35 216 200.47
BNG assembled molecules had a higher N50 and longer cumulative length
than the sequence assembly
!
The estimated size of the Tribolium genome is ~200 (Mb)
13

Simplest XMAP alignment description
1 (Mb)
1.1 (Mb)
1.1 (Mb) 1.3 (Mb)
Breadth of alignment coverage for in silico CMAP: 2.1 (Mb)
Total alignment length for in silico CMAP: 2.1 (Mb)
!
Breadth of alignment coverage for BNG CMAP: 2.4 (Mb)
Total alignment length for BNG CMAP: 2.4 (Mb)
in silico CMAP
from genome
FASTA
CMAP from
assembled
molecules
in silico CMAP 1 in silico CMAP 2
14

Complex XMAP alignment description
1 (Mb)
in silico CMAP 1
1.1 (Mb) 1.3 (Mb)
Breadth of alignment coverage for in silico CMAP: 1 (Mb)
Total alignment length for in silico CMAP: 2 (Mb)
!
Breadth of alignment coverage for BNG CMAP: 2.4 (Mb)
Total alignment length for BNG CMAP: 2.4 (Mb)
in silico CMAP
from genome
FASTA
CMAP from
assembled
molecules
15

Alignment of CMAPs
1 (Mb)
in silico CMAP 1
1.1 (Mb) 1.3 (Mb)
Breadth of alignment coverage compared to total aligned length can indicate
relevant relationships between assemblies
!
In this example differences between "breadth" and "total" length could be due to:
!
Genomic duplications in sample molecules were extracted from
Assembly of alternate haplotypes
Mis-assembly creating redundant contigs
Collapsed repeat in sequence assembly
in silico CMAP
from genome
FASTA
CMAP from
assembled
molecules
16

Alignment of BNG assembly to reference genome
CMAP name Breadth of alignment
coverage for CMAP
(Mb)
Length of total
alignment for
CMAP (Mb)
Percent of CMAP
aligned
in silico CMAP from FASTA 124.04 132.40 81
CMAP from assembled BNG
molecules (BNG CMAP)
131.64 132.34 67
Close to 4% of the alignment of the in silico CMAP appears to be redundant
!
Overall 81% of the in silico CMAP aligns to the BNG consensus map
17

ChLG 9 super!
Alignment of BNG assembly to reference genome
scaffold
BNG consensus
maps
ChLG 9!
scaffolds
130 131 133 134 132 129 135 127 136 137 BNG consensus
Typically where redundant alignments occur two BNG consensus maps
aligned suggesting they represent haplotypes although this has not been
verified
maps
18

Potential haplotypes where overlapping BNG cmaps align
ChLG 9 super!
scaffold
BNG consensus
maps
ChLG 9!
scaffolds
128 130 131 133 134 132 BNG consensus
maps
19

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium
scaffolds
+ in silico CMAP 1 + in silico CMAP 4
Stitch.pl estimates super scaffolds using alignments of scaffolds and
assembled BNG molecules using BNG Refaligner
in silico CMAP
aligned as
reference
+ in silico CMAP 2 - in silico CMAP 3
20

scaffolds
in silico CMAP
aligned as
reference
alignment is
inverted and
used as input for
stitch
20

scaffolds
in silico CMAP
aligned as
reference
alignment is
inverted and
used as input for
stitch
+ in silico CMAP 4
alignments are
filtered based on
alignment length
relative total
possible
alignment length
and confidence
+ in silico CMAP 1
20

scaffolds
BNG CMAP 1
+ in silico CMAP 1
Stitch.pl checks alignment length against potential alignment lengths to find
relevant global rather than local alignments
alignment
passes because
the alignment
length is greater
than 30% of the
potential
alignment length
21

BNG CMAP 1
scaffolds
+ in silico CMAP 2
alignment
passes because
the alignment
length is greater
than 30% of the
potential
alignment length
22

scaffolds
- in silico CMAP 2
BNG CMAP 2
alignment
passes because
the alignment
length is greater
than 30% of the
potential
alignment length
23

scaffolds
- in silico CMAP 2
BNG CMAP 2
alignment fails
because the
alignment length
is less than 30%
of the potential
alignment length
24

scaffolds
+ in silico CMAP 2
BNG CMAP 2
alignment fails
because the
alignment length
is less than 30%
of the potential
alignment length
25

scaffolds
BNG CMAP 2
alignment
passes because
the alignment
length is greater
than 30% of the
potential
alignment length
- in silico CMAP 3
26

BNG CMAP 2
scaffolds
alignment fails
because the
alignment length
is less than 30%
of the potential
alignment length
- in silico CMAP 3
27

BNG CMAP 2
scaffolds
alignment
passes because
the alignment
length is greater
than 30% of the
potential
alignment length
+ in silico CMAP 4
28

scaffolds
+ in silico CMAP 4
high quality
scaffolding
alignments...
+ in silico CMAP 1
29

scaffolds
are filtered for
longest and
highest
confidence
alignment for
each in silico
CMAP
+ in silico CMAP 4
high quality
scaffolding
alignments...
+ in silico CMAP 1
29

scaffolds
are filtered for
longest and
highest
confidence
alignment for
each in silico
CMAP
Passing
alignments are
used to super
scaffold
+ in silico CMAP 4
high quality
scaffolding
alignments...
+ in silico CMAP 1
29

scaffolds
Stitch is iterated
and additional
super
scaffolding
alignments are
found

scaffolds
Stitch is iterated
and additional
super
scaffolding
alignments are
found
Until all super
scaffolds are
joined + in silico CMAP 2 - in silico CMAP 3

scaffolds
- in silico CMAP 3
+ in silico CMAP 2
+ in silico CMAP 4
+ in silico CMAP 1
If gap length is estimated to be negative gaps are represented by 100 (bp)
spacers
31

Gap lengths
Distribution of gap lengths for automated output
Gap length (bp)
Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had
known lengths and 26 had negative lengths (set to 100 (bp))
!
Of the manually edited Tribolium super-scaffolds there were 66 gaps had
Count
−1500000 −1000000 −500000 0 500000 1000000
0 5 10 15 20
Negative gap lengths
Positive gap lengths
32

Is part of scaffold_23 connected to 136?!
I went with the second alignment (21-26 together and 136-137 together because it is supported by genetic maps) but we should
check these assemblies. ! ! In bottom alignment of 136 you can see that a large section of the BNG map 32 (which joins 23 to 136) is a duplicate in the BNG
assembly?
22 23 129 136 137
The longest negative gap length is from a BNG consenus map joining in silico
23 and 136
33

Is part of scaffold_23 connected to 136?!
I went with the second alignment (21-26 together and 136-137 together because it is supported by genetic maps) but we should
check these assemblies. ! ! In bottom alignment of 136 you can see that a large section of the BNG map 32 (which joins 23 to 136) is a duplicate in the BNG
assembly?
22 23 129 136 137
!
Because the same region of 136 aligns to another BNG consensus map that
aligns to its chromosome linkage group this alignment was rejected and stitch
was re-run
34

scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?
ChLG 2 super!
scaffold
ChLG 2 super!
scaffold
BNG consensus
maps
BNG consensus
maps
ChLG 2!
scaffolds
133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145
Two new super scaffolds were created and the sequence similarity is being
evaluated
min confidence 10
scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?
ChLG 2!
scaffolds
130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145
U 18 14 16 19 20 21 22 23 24 25 26 27 28 30
BNG consensus
maps
U 18 14 16 19 20 21 22 23 24 25 26 27 28 30
BNG consensus
maps
35

Gap lengths
Gap length (bp)
This negative alignment also indicated a potential assembly issue
Count
−1500000 −1000000 −500000 0 500000 1000000
0 5 10 15 20
36

This negative gap length is from a BNG consenus map joining in silico 81 and
102 and 103
Half of scaffold_81 aligns with ChLG7
37

Half of scaffold_81 aligns with ChLG7
79 80 81 82 83
Because the other half of 81 aligns to another BNG consensus map that aligns
to its chromosome linkage group this alignment was rejected and stitch was re-run
!
The BNG maps suggest a mis-assembly of in silico 81 at a sequence level
38

Gap length (bp)
Count
−1500000 −1000000 −500000 0 500000 1000000
0 5 10 15 20
Gap lengths
All extremely small negative gap lengths, < -20,000 (bp) (shaded), were
independently flagged as potential sequence mis-assemblies to be checked at
the sequence-level
39

Gap length (bp)
Count
−1500000 −1000000 −500000 0 500000 1000000
0 5 10 15 20
Gap lengths
All gaps from the shaded regions were also manually rejected and stitch.pl
was rerun without them for the current super-scaffolded assembly
!
We suspect extremely small negative gap sizes may be useful in locating
sequence mis-assemblies
!
stitch.pl version 1.4.5 rejects alignments if negative gap lengths < -20,000 (bp)
but lists them in data summary
40

Tribolium super-scaffolds
Input file N50 (Mb) Number of
Scaffolds
Cumulative
Length (Mb)
genome FASTA 1.16 2240 160.74
super-scaffold
FASTA
4.46 2150 165.92
N50 of the super-scaffolded genome was ~4 times greater than the original
!
Super-scaffolds tend to agree with the Tribolium genetic map
41

Input file N50 (Mb) Number of
Scaffolds
genome FASTA 1.16 2240 160.74
4.46 2150 165.92
For Tribolium :
first minimum percent aligned = 30%
first minimum confidence = 13
Cumulative
Length (Mb)
second minimum percent aligned = 90%
second minimum confidence = 8
!
super-scaffold
FASTA
Lower quality alignments were manually selected if genetic map also supported
the order
Complex scaffolds were broken manually for sequence level evaluation
42

min confidence 10
From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold.
ChLG X was reduced from 13 scaffolds to 2 with one scaffold being moved to
ChLG 3
ChLG X super!
scaffold
BNG consensus
maps
ChLG X!
scaffolds
BNG consensus
maps
U 3 4 5 6 7 U 8 9 10 11 12 13
43

min confidence 10
51 U 43 45 44 46
The second scaffold from ChLG X aligned to scaffolds from a portion of
ChLG 3
ChLG 3 super!
scaffold
BNG consensus
maps
ChLG 3!
scaffolds
BNG consensus
maps
32 33 34 35 36 2 37 38 39 40 41 42
ChLG 3 super!
scaffold
BNG consensus
maps
ChLG 3 super!
scaffold
BNG consensus
44

min confidence 10
Two unplaced scaffolds aligned to ChLG X
ChLG X super!
scaffold
BNG consensus
maps
ChLG X!
scaffolds
BNG consensus
maps
U 3 4 5 6 7 U 8 9 10 11 12 13
45

min confidence 10
4% Redundancy in alignment may be from assembly of haplotypes (generally
observed as two BNG consensus maps aligning to the same in silico map)
ChLG X super!
scaffold
BNG consensus
maps
ChLG X!
scaffolds
BNG consensus
maps
U 3 4 5 6 7 U 8 9 10 11 12 13
46

Potential haplotypes where overlapping BNG cmaps align
min confidence 10
ChLG X super!
scaffold
BNG consensus
maps
ChLG X!
scaffolds
BNG consensus
maps
U 3 4 5 6 7 U 8 9 10 11 12 13
47

Tribmoinl icuonmfid esnucep 1e0r-scaffolds
ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?
128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145
For ChLG 9 21 scaffolds were reduced to 9
ChLG 9 super!
scaffold
BNG consensus
maps
ChLG 9!
scaffolds
BNG consensus
maps
48

min confidence 10
For ChLG 5 17 scaffolds were reduced to 4
ChLG 5 super!
scaffold
BNG consensus
maps
ChLG 5!
scaffolds
BNG consensus
maps
69 68 70 71 72 73 74 U 75 76 77 78 79 80 81 82 83
49

Future directions: Structural Variant (SV)
Use SV-detect pipelines to resize existing gaps in scaffolds and identify
mis-assemblies
50

Acknowledgements
K-INBRE Bioinformatics Core!
Susan Brown - PI
Nic Herndon - script development
Nanyan Lu - manual evaluation
Michelle Coleman - extractions and running the Irys!
Zachary Sliefert - metric summaries
!
Bionano Genomics!
Ernest Lam - assembly pipeline best practices assistance
Weiping Wang - assistance with data formats
Palak Sheth - collaboration to standardize analysis
!
Script availability!
https://github.com/i5K-KINBRE-script-share/Irys-scaffolding
BNG scripts available by request from BNG
!
Slide availability!
http://www.slideshare.net/kstatebioinformatics/using-bionano-maps-to-improve-an-insect-genome-
Physical
Molecules!
University, Warren
Kansas State University
assembly
!
This project was supported by grants from the National Center for Research Resources
(5P20RR016475) and the National Institute of General Medical Sciences (8P20GM103418) from
the National Institutes of Health.
were constructed by mapping molecular markers from the
the assembly scaffolds, anchoring greater than 90%
51

Gap lengths
Gap length (bp)
Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had
!
Of the manually edited Tribolium super-scaffolds there were 66 gaps had
Count
−1500000 −1000000 −500000 0 500000 1000000
0 5 10 15 20

Using BioNano Maps to Improve an Insect Genome Assembly

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Using BioNano Maps to Improve an Insect Genome Assembly

Ähnlich wie Using BioNano Maps to Improve an Insect Genome Assembly (20)

Mehr von Jennifer Shelton

Mehr von Jennifer Shelton (14)