This was a talk given on 2014-06-19 for the Genome Center’s Bioinformatics Core as part of a 1 week workshop on using Galaxy. It concerns the Assemblathon projects as well as other aspects relating to genome assembly.
A version of this talk is also available on Slideshare with embedded notes.
Note, this is an evolving talk. There are older and newer versions of the talk also available on slideshare.
43. Basic assembly metrics
Metric Description
Assembly size With or without very short contigs?
N50 / NG50 For contigs and/or scaffolds
Coverage When compared to a reference sequence
Errors
Base errors from alignment to reference sequence !
and/or input read data
Number of genes
From comparison to reference transcriptome !
and/or set of known genes
44. Basic assembly metrics
Metric Description
Assembly size With or without very short contigs?
N50 / NG50 For contigs and/or scaffolds
Coverage When compared to a reference sequence
Errors
Base errors from alignment to reference sequence !
and/or input read data
Number of genes
From comparison to reference transcriptome !
and/or set of known genes
And many, many more...
52. Genetic maps ✓
Physical maps ✓
Understanding of target genome ✓
Haploid / low heterozygosity genome ✓
Accurate & long reads ✓
Genome assembly: then
53. Genetic maps ✓
Physical maps ✓
Understanding of target genome ✓
Haploid / low heterozygosity genome ✓
Accurate & long reads ✓
Resources (time, money, people) ✓
Genome assembly: then
54. So what was the result of spending millions of dollars !
to assemble genomes of well-characterized species,!
with accurate long reads, and detailed maps???
80. Comparing assemblers
✤ Can't fairly compare two assemblers if they:
✤ produced assemblies from different species
81. Comparing assemblers
✤ Can't fairly compare two assemblers if they:
✤ produced assemblies from different species
✤ assembled same species, but used sequence data from
different sequencing technologies
82. Comparing assemblers
✤ Can't fairly compare two assemblers if they:
✤ produced assemblies from different species
✤ assembled same species, but used sequence data from
different sequencing technologies
✤ used same sequencing technologies but have different
sequence libraries
83. Comparing assemblers
✤ Can't fairly compare two assemblers if they:
✤ produced assemblies from different species
✤ assembled same species, but used sequence data from
different sequencing technologies
✤ used same sequencing technologies but have different
sequence libraries
✤ Even using different options for the same assembler may produce
very different assemblies!
108. Goals
✤ Assess 'quality' of assemblies
✤ Define quality!
✤ Produce ranking of assemblies for each species
109. Goals
✤ Assess 'quality' of assemblies
✤ Define quality!
✤ Produce ranking of assemblies for each species
✤ Produce ranking of assemblers across species?
110. Who did what?
Person/group Jobs
Me, Ian Korf, and Joseph Fass Perform various analyses of all assemblies
David Schwarz et al. Produce & evaluate optical maps
Jay Shendure et al.
Produce Fosmid sequences !
(bird & snake only)
Martin Hunt & Thomas Otto Performed REAPR analysis
Dent Earl & Benedict Paten Help with meta-analysis of final rankings
126. 3) Gene-sized scaffolds
✤ Some assembly folks get a little obsessed by length!
✤ How long is 'long enough' for a scaffold?
127. 3) Gene-sized scaffolds
✤ Some assembly folks get a little obsessed by length!
✤ How long is 'long enough' for a scaffold?
✤ What if you just wanted to find genes?
128. 3) Gene-sized scaffolds
✤ Some assembly folks get a little obsessed by length!
✤ How long is 'long enough' for a scaffold?
✤ What if you just wanted to find genes?
✤ Average vertebrate gene = ~25 Kbp
132. 4) Core genes
✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach)
✤ CEGMA uses a set of 458 'Core Eukaryotic Genes' (CEGs)
133. 4) Core genes
✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach)
✤ CEGMA uses a set of 458 'Core Eukaryotic Genes' (CEGs)
✤ CEGs are conserved in: S. cerevisiae, S. pombe, A. thaliana,
C. elegans, D. melanogaster, and H. sapiens
134. 4) Core genes
✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach)
✤ CEGMA uses a set of 458 'Core Eukaryotic Genes' (CEGs)
✤ CEGs are conserved in: S. cerevisiae, S. pombe, A. thaliana,
C. elegans, D. melanogaster, and H. sapiens
✤ How many full-length CEGs are in each assembly?
141. 8 & 9) Optical maps
✤ Stretch out DNA
✤ Cut with restriction enzymes
142. 8 & 9) Optical maps
✤ Stretch out DNA
✤ Cut with restriction enzymes
✤ Note lengths of fragments
143. 8 & 9) Optical maps
✤ Stretch out DNA
✤ Cut with restriction enzymes
✤ Note lengths of fragments
✤ Compare to in silico digest of scaffolds
144. 8 & 9) Optical maps
✤ Stretch out DNA
✤ Cut with restriction enzymes
✤ Note lengths of fragments
✤ Compare to in silico digest of scaffolds
✤ Not all scaffolds suitable for analysis
145. 8 & 9) Optical maps
Image from University of Wisconsin-Madison
165. Some conclusions
✤ Very hard to find assemblers that performed well across
all 10 key metrics!
✤ Assemblers that perform well in one species, do not
always perform as well in another!
✤ Bird & snake assemblies appear better than fish!
✤ No real 'winner' for bird and fish
186. The choice of one command-line option,!
used by one tool in the calculation of one key metric...
...probably made enough difference to drop!
the PacBio-containing assembly to 2nd place.
187. Other conclusions
✤ Different metrics tell different stories!
✤ Heterozygosity was a big issue for bird & fish assemblies!
✤ Final rankings very sensitive to changes in metrics!
✤ N50 is a semi-useful predictor of assembly quality
192. Inter-specific differences matter
✤ The three species have genomes with different properties !
✤ repeats!
✤ heterozygosity
✤ The three genomes had very different NGS data sets!
✤ Only bird had PacBio & 454 data!
✤ Different insert sizes in short-insert libraries
199. A wish list for Assemblathon 3
✤ Only have 1 species
200. A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
201. A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
✤ Factor in CPU time/cost?
202. A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
✤ Factor in CPU time/cost?
✤ Agree on metrics before evaluating assemblies!
203. A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
✤ Factor in CPU time/cost?
✤ Agree on metrics before evaluating assemblies!
✤ Encourage experimental assemblies
204. A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
✤ Factor in CPU time/cost?
✤ Agree on metrics before evaluating assemblies!
✤ Encourage experimental assemblies
✤ Use new FASTG genome assembly file format
205. A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
✤ Factor in CPU time/cost?
✤ Agree on metrics before evaluating assemblies!
✤ Encourage experimental assemblies
✤ Use new FASTG genome assembly file format
✤ Get someone else to write the paper!
214. NGS madness
Next generation sequencing
aka second generation sequencing
but there’s also: third generation sequencing
215. NGS madness
Next generation sequencing
aka second generation sequencing
but there’s also: third generation sequencing
fourth generation sequencing
216. NGS madness
Next generation sequencing
aka second generation sequencing
but there’s also: third generation sequencing
fourth generation sequencing
next-next generation sequencing
217. NGS madness
Next generation sequencing
aka second generation sequencing
but there’s also: third generation sequencing
fourth generation sequencing
next-next generation sequencing
next-next-next generation sequencing
219. NGS madness
Technology
Complete Genomics
Ion Torrent
PacBio
Oxford Nanopore
According to
some papers…
2nd generation
2nd generation
2nd generation
3rd generation
According to
other papers…
3rd generation
3rd generation
3rd generation
4th generation
220. NGS madness
“PacBio is a 2.5th generation”
“Helicos lies between the transition of next-generation to third generation”
221. NGS madness
There are different sequencing methodologies, !
and there are different sequencing platforms.
222. NGS madness
There are different sequencing methodologies, !
and there are different sequencing platforms.
Use one or the other.
223. NGS madness
There are different sequencing methodologies, !
and there are different sequencing platforms.
Use one or the other.
Or just say ‘current sequencing technologies’.
262. The future of genome assembly
✤ At some point we will look back with embarrassment at this era.
263. The future of genome assembly
✤ At some point we will look back with embarrassment at this era.
✤ Assembly must, and will, get better, but...
264. The future of genome assembly
✤ At some point we will look back with embarrassment at this era.
✤ Assembly must, and will, get better, but...
✤ ...'perfect' genomes may remain elusive.
265. The future of genome assembly
✤ At some point we will look back with embarrassment at this era.
✤ Assembly must, and will, get better, but...
✤ ...'perfect' genomes may remain elusive.
✤ Data management will remain an issue:
266. The future of genome assembly
✤ At some point we will look back with embarrassment at this era.
✤ Assembly must, and will, get better, but...
✤ ...'perfect' genomes may remain elusive.
✤ Data management will remain an issue:
✤ the human genome -> human genomes -> tissue-specific genomes
268. Summary
✤ There is no real consensus on how to make a good genome assembly
269. Summary
✤ There is no real consensus on how to make a good genome assembly
✤ Try different assemblers, try different command-line options
270. Summary
✤ There is no real consensus on how to make a good genome assembly
✤ Try different assemblers, try different command-line options
✤ Decide what it is you want to get out of a genome assembly
271. Summary
✤ There is no real consensus on how to make a good genome assembly
✤ Try different assemblers, try different command-line options
✤ Decide what it is you want to get out of a genome assembly
✤ Look at your input and output data
272. Summary
✤ There is no real consensus on how to make a good genome assembly
✤ Try different assemblers, try different command-line options
✤ Decide what it is you want to get out of a genome assembly
✤ Look at your input and output data
✤ Wait 5 years and come back, we’ll (probably) have solved everything!
273. Resources
✤ Lex Nederbragt’s blog - https://flxlexblog.wordpress.com!
✤ Nick Loman’s blog - http://pathogenomics.bham.ac.uk/blog/!
✤ Assemblathon twitter feed - https://twitter.com/assemblathon