Automated assemblies are one thing, good assemblies are another!
This presentation covers the basic concepts of using paired-end and mate pair read data to identify mis-assemblies. It also covers some of the tools for visualising and correcting mis-assemblies. An attempt is made to rate these tools on their feature set and scalability beyond small (<15MBase) genomes and provides some closing remakes about what the ideal genome assembly editing tool should have in terms of features.
1. Genome Assembly Forensics and
Visualisation
Nathan S. Watson-Haigh
Fri 11th May 2012, ACPFG Journal Club
Schatz, M.C. et al., 2007. Hawkeye: an interactive visual analytics tool for genome assemblies. Genome Biology, 8(3), p.R34.
Phillippy, A.M., Schatz, M.C. & Pop, M., 2008. Genome assembly forensics: finding the elusive mis-assembly. Genome
Biology, 9(3), p.R55.
Schatz, M.C. et al., 2011. Hawkeye and AMOS: Visualizing and Assessing the Quality of Genome Assemblies. Briefings in
Bioinformatics. Available at: http://bib.oxfordjournals.org/content/early/2011/12/23/bib.bbr074.
9. Assembly Metrics – N50
• The N50 is the most widely reported metric for de
novo assemblies
• It is a single measure of the contig length size
distribution of an assembly
– If contigs are sorted into descending length order, the
n50 is the size of the contig above which the assembly
contains at least 50% of the total length of all the
contigs
– Commonly reported with the N90 and N95
11. Assembly Metrics – N50
• The N50 is the most widely reported metric for de
novo assemblies
• It is a single measure of the contig length size
distribution of an assembly
– If contigs are sorted into descending length order, the
n50 is the size of the contig above which the assembly
contains at least 50% of the total length of all the
contigs
– Commonly reported with the N90 and N95
• These stats DO NOT imply anything about
assembly quality
– Could simply concatenate contigs together to get a
better N50!!
27. Automated Assemblies Are One
Thing, Good Assemblies Are Another
• Given the computer resources you can generate
an automated assembly in a few weeks
– Not necessarily good
– Need to optimise assembly parameters
• For small organisms (< ~15Mbases)
– Commodity hardware
– OLC assemblers
• For larger genomes
– More RAM (10-100’s Gbytes) for OLC assemblers
– De Bruijin Graph assemblers
– Read Mapping step to generate contig read alignments
28. Automated Assemblies Are One
Thing, Good Assemblies Are Another
• Automated assemblies need to be checked for
mis-assemblies
– Need paired-end/matepair reads
– Need viewers to visualise paired-end data
– Need editors to break/join/reassemble parts of the
assembly deemed to be inconsistent with read pair info
– Need enough computer hardware to allow all this data to
be loaded – especially with large volumes of Illumina
paired-end data
29. Automated Assemblies Are One
Thing, Good Assemblies Are Another
• Very time consuming and laborious to check/edit
– Small assemblies (< ~15Mbases)
• Several weeks/few months to move 1 scaffold/contig at a
time
– Large assemblies need a team to do the same thing
• Need enough RAM to load all the paired-end data
• Need ways to identify regions requiring closer inspection
• identify possible mis-assemblies
• Major hurdles
– Software inadequacies
– Time
– File formats! Grrrr!
30. Software Inadequacies
Software Contig Scaffold Editing Reassemble Clipping Other
View View Info
SeqMan 9 9 6 6 6 $$, buggy, not for
Pro large assemblies
(32bit), 1 template size
Gap5 6 NA 9 NA 8 Free, join editor, contig
comparator, poor visual
support for many
contigs, shuffle pads,
ACE, multiple template
sizes
Consed 6 6 5 9 6 Free/US$2500/US$10k
, poor visual support
for many contigs,
multiple templates
sizes
Hawkeye 9 9 NA NA 7 Leverages AMOS,
automated detection of
mis-assemblies, large
assemblies, modular
33. Software Inadequacies
Software Contig Scaffold Editing Reassemble Clipping Other
View View Info
SeqMan 9 9 6 6 6 $$, buggy, not for
Pro large assemblies
(32bit), 1 template size
Gap5 6 NA 9 NA 8 Free, join editor, contig
comparator, poor visual
support for many
contigs, shuffle pads,
ACE, multiple template
sizes
Consed 6 6 5 9 6 Free/US$2500/US$10k
, poor visual support
for many contigs,
multiple templates
sizes
Hawkeye 9 9 NA NA 7 Leverages AMOS,
automated detection of
mis-assemblies, large
assemblies, modular
38. Software Inadequacies
Software Contig Scaffold Editing Reassemble Clipping Other
View View Info
SeqMan 9 9 6 6 6 $$, buggy, not for
Pro large assemblies
(32bit), 1 template size
Gap5 6 NA 9 NA 8 Free, join editor, contig
comparator, poor visual
support for many
contigs, shuffle pads,
ACE, multiple template
sizes
Consed 6 6 5 9 6 Free/US$2500/US$10k
, poor visual support
for many contigs,
multiple templates
sizes
Hawkeye 9 9 NA NA 7 Leverages AMOS,
automated detection of
mis-assemblies, large
assemblies, modular
41. Software Inadequacies
Software Contig Scaffold Editing Reassemble Clipping Other
View View Info
SeqMan 9 9 6 6 6 $$, buggy, not for
Pro large assemblies
(32bit), 1 template size
Gap5 6 NA 9 NA 8 Free, join editor, contig
comparator, poor visual
support for many
contigs, shuffle pads,
ACE, multiple template
sizes
Consed 6 6 5 9 6 Free/US$2500/US$10k
, poor visual support
for many contigs,
multiple templates
sizes
Hawkeye 9 9 NA NA 7 Leverages AMOS,
automated detection of
mis-assemblies, large
assemblies, modular
50. Compression-Expansion (CE)
Statistic
• A measure of the
deviation of local
distribution of insert sizes
to the global distribution
of insert sizes
– 0 indicates no deviation
– ≤ 3 indicates much
compression
– ≥3 indicates much
expansion
53. AMOSvalidate
• An assembly analysis pipeline to identify possible
mis-assemblies
– Paired-end data
• CE stats
• Incorrect orientation
• Missing mate
– Coverage
– SNP density
– Singletons
56. Software Inadequacies
Software Contig Scaffold Editing Reassemble Clipping Other
View View Info
SeqMan 9 9 6 6 6 $$, buggy, not for
Pro large assemblies
(32bit), 1 template size
Gap5 6 NA 9 NA 8 Free, join editor, contig
comparator, poor visual
support for many
contigs, shuffle pads,
ACE, multiple template
sizes
Consed 6 6 5 9 6 Free/US$2500/US$10k
, poor visual support
for many contigs,
multiple templates
sizes
Hawkeye 9 9 NA NA 7 Leverages AMOS,
automated detection of
mis-assemblies, large
assemblies, modular
57. Closing Remarks
• Software exist to allow manual editing of
assemblies
– Time consuming
– Different tools have different features
– Most fall over with assemblies > ~15Mbases or with
many contigs/scaffolds (10k-100k)
58. Closing Remarks
• Ideal Tool
– Contig/scaffold viewer capable of displaying
compressed/expanded mates, which contigs mates map
to when they are off contig/scaffold (like SeqMan Pro
and Hawkeye)
59.
60.
61. Closing Remarks
• Ideal Tool
– Contig/scaffold viewer capable of displaying
compressed/expanded mates, which contigs mates map
to when they are off contig/scaffold (like SeqMan Pro
and Hawkeye)
– Contig join editor for manual alignment and editing of
contigs (like Gap5)
63. Closing Remarks
• Ideal Tool
– Contig/scaffold viewer capable of displaying
compressed/expanded mates, which contigs mates map
to when they are off contig/scaffold (like SeqMan Pro
and Hawkeye)
– Contig join editor for manual alignment and editing of
contigs (like Gap5)
– Visualise clipped regions with consensus mismatches
(like Gap5)
65. Closing Remarks
• Ideal Tool
– Contig/scaffold viewer capable of displaying
compressed/expanded mates, which contigs mates map
to when they are off contig/scaffold (like SeqMan Pro
and Hawkeye)
– Contig join editor for manual alignment and editing of
contigs (like Gap5)
– Visualise clipped regions with consensus mismatches
(like Gap5)
– Automated analysis of assembly to identify regions
requiring attention (like AMOSvalidate) and a way to
navigate to those regions for editing
– Minimise mouse-clicks and keyboard presses!!
66.
67. Newbler Plant Genome Assemblies
• Pretty conservative in contig construction
• Seems to split out repetitive regions into their
own contigs pretty well
• Heterozygsity issues
– SNP alignment issues
– Indels break contigs
– Hidden in clipped regions
– Manual joining of neighbouring contigs can reduce
scaffolded contig numbers by 60-70%
– Many unscaffolded contigs have high sequence similarity
to scaffolded contigs – could collapse these and reduce
the number of unscaffolded contigs by 50%
The DNA shown doesn’t imply that we are doing mapping, but is show to exemplify what the correct answer should be for the de novo assembly
A repeat is: DNA with almost identical sequenceA repeat can arise through ...
Double coverage
Double coverageRed line = consensus mismatches
Complicated by higher levels of heterozygosity. If this gets high enough, assemblers tend to split the different allelles. This is further excasterbated by heterozygotindels.
Some genome assemblers can output the actual read alignments used in the generation of contigs. However, most do not. To generate an approximation to the alignment the assembler might have used, you need to map the reads back to the contig consensus sequences that the assembler generated.So for repeat rich and highly heterozygote organisms it’s important to consider several alignments for any given read, as the distance constriant between it and its mate many not be satisfied by the best alignments for each of the reads independently.
Mis-assembly signatures are determined by the ratio of consistent/inconsistent read pairs so coverage is important in identifying mis-assemblies over random artefacts.
Incorrect orientation with varying stretched/shrunken/correct distance.Shrunken mates that span over the collapsed repeat.Small collapsed repeats only detectable by incorrect orientation
Small collapsed repeats only detectable by incorrect orientation as the spanning mates may be within the expected size distribution of the mate library.
Links to a contig outside the current scaffold.Compressed mates due to collapsed repeat.