A retrospective look at the state of many famous modern genome sequences, and a cautionary tale of the dangers in assuming that genome sequence and/or its annotations are finished.
2. Part 1 - the sequence
We can think of ‘genome completion’ as referring to the sequence and/or the set of gene
annotations. Let’s start with the sequence.
3. A brief history of genomics
Wu & Taylor determine the first ever
1971 DNA sequence (all 12 bp of it!)
Sanger et al. sequence the first ever
1977 (DNA-based) virus genome - 5,375 bp
First complete bacterial genome sequence
1995 (Haemophilus influenzae) - 1.83 Mb
First complete eukaryotic genome
1996 (Saccharomyces cerevisiae) - 12 Mb
First animal genome
1998 (Caenorhabditis elegans) - 100 Mb
It took 18 years before we knew the structure of DNA before anyone could sequence it. First DNA sequence was from the end of a
bacteriophage lambda virus (written in a 20 page paper). First genome was actually an RNA viral genome determined in 1975 by Fiers
et al. The 1980’s and 1990’s saw the start of widespread DNA sequencing for genes of interest in species of interest. Moving to
eukaryotic genome sequencing means determining multiple chromosomes, and tackling bigger repeats (more assembly problems).
4. genomesonline.org
6000
4500
3000 3,077 7,732
1500
0
Complete Incomplete
Bacteria Archaea Eukaryotes
Genomesonline.org tries to track all of the major genome projects out there. A lot of them
are flagged as incomplete, and maybe some of those will never reach ‘completion’ status.
5. CAP criteria
1) Complete
2) Accessible
3) Permanent
Sydney Brenner
The great biologist may have won a Nobel prize for his work on development, he may have
postulated the very existence of mRNA, and he may have co-discovered the triplet code ...
but he also came up with the CAP criteria.
These criteria could pertain to any large scale academic project, but they conceived with
reference to genome sequencing projects.
6. Homo sapiens
2000 - ‘working draft’ announced
2001 - ‘working draft’ published
2003 - ‘Finished’ version announced
2006 - Last chromosome finished
So it’s finished now right?
Ns make up ~9% of current genome
The human genome has been finished on several different dates, depending how you define
‘finished’. Ns – unknown bases – still account for 9% of the 3.1 Gbp genome.
7. Drosophila melanogaster
2000 - genome published
~175 MB genome
So it’s finished now right?
Ns make up ~4% of current genome
Drosophila is a much smaller genome, but a third of the genome is represented by the
harder-to-sequence heterochromatin. This was the subject of a separate genome project that
didn’t finish until 2007.
The genome still has many Ns.
8. Arabidopsis thaliana
Published 2000
115 Mb sequenced,
125 Mb genome
As of 2007...
119 Mb sequenced,
157 Mb genome
As of 2012...
119 Mb sequenced,
135 Mb genome
N’s make up ~0.2% of current genome
Many published genome sizes are sometimes based on estimates which can be wrong. As they sequenced more and more
of the Arabidopsis genome, they had to revise how big it was. So between 2000 and 2007 they produced more sequence
but paradoxically it became less complete.
This illustrates the difficulty of estimating genome size. The latest figures suggest that the genome is smaller again. Note
that much of this missing genome is not present as Ns in sequence you download. But the part you can still download
still has many unknown bases.
9. Caenorhabditis elegans
1998 - ‘finished’ genome published
97 100 MB genome
2002 - last gap closed
Genome information for species such as C. elegans are curated by model organism databases (MODs) that ensure that
the work goes on long after the initial publication announcing a ‘finished’ genome is made.
Genome size was quickly revised from 97 MB to 100 MB not long after publication.
10. Where’s my gene???
2002 2001 2000 1997
People will often know that their gene of interest is definitely present in a genome through traditional genetic experiments...however, it
might not be present in the published genome sequence. The figure shows the times at which one end of chromosome X of C.
elegans were finished. The last 20 kbp region wasn’t finished until four years after the genome was published in 1998. This region
contained predicted genes...maybe scientists were working on these genes waiting for the sequence.
11. Caenorhabditis elegans
1998 - ‘finished’ genome published
97 100 MB genome
2002 - last gap closed
2004 - last N removed
So it’s finished now right?
Unlike the previous genomes, C. elegans has no Ns (but this took 6 years after publication to achieve).
12. Worm genome progress
100,000,000
80,000,000
Genome size (bp)
60,000,000
40,000,000
20,000,000
0
Jan-91 Dec-92 Nov-94 Oct-96 Sep-98 Aug-00 Jul-02 Jun-04 May-06
Date
At a gross level, it looks like the worm genome did not change much after the year 2000....
13. Worm genome progress
100,280,000
66 nt added
100,260,000
May 2010
Genome size (bp)
100,240,000
100,220,000
Sep-01 Jul-02 May-03 Mar-04 Dec-04 Oct-05
Date
Here is a zoom in of the years 2001–2005...still lots of sequence changes happening. The last change on this graph represents a
very small addition of 66 bp to the genome. Maybe this change will not make any difference to anyone in the world, but it still makes
the genome sequence more accurate and closer to the biological truth
Not many genome projects are this devoted!
14. Saccharomyces cerevisiae
Published 1997
12 MB genome
No gaps, no N’s
So it’s finished now right?
1,653 genome changes made since 1997
Last change made in February 2011
Like C. elegans, yeast is a species which benefits from coordinated efforts to finish the genome.
In February 2011, the yeast genome sequence underwent corrections that affected 194 proteins. This happened in a – by
today’s standards – tiny genome which has been studied and curated for 15 years! What hope for larger, more complex
genomes?
15. Part 2 - annotations
Maybe you don’t care about the state of the genome, as long as you have all of the genes
present.
16. C. elegans annotations
Genes Proteins
25000
23500
22000
20500
19000
1998 2003 2004 2005 2006 2007 2008 2009 2010 2011
Genome publication
Since publication, the number of protein-coding loci in C. elegans has risen by about 1,500
genes. But the number of proteins that might arise from alternatively spliced products is
much, much higher and shows no signs of slowing down.
17. C. elegans annotations
Genes Proteins RNA genes
25000
18750
12500
6250
0
1998 2003 2004 2005 2006 2007 2008 2009 2010 2011
Genome publication
When we consider RNA genes, it is surprising that there are now more RNA genes than
protein-coding genes. How many more species have similar secrets in their genomes that
have yet to be discovered, mostly because of our historical focus on protein-coding genes.
18. Core genes
You can identify ‘core’ genes, that are highly conserved
and that should be present in all species
Our group identified a set of 458 core genes from 6
reference genomes:
Homo sapiens
Caenorhabditis elegans
Drosophila melanogaster
Arabidopsis thaliana
Saccharomyces cerevisiae
Schizosaccharomyces pombe
We can then test whether these are all present in any
‘finished’ genome.
Our lab developed a set of 458 ‘core genes’ that we believe should be present in every
(complete) eukaryotic genome.
In the past we’ve discovered that many published genomes are missing some of these genes
from the genome sequence, even though they should be there. E.g. chicken has missing core
genes even though those genes are represented by chicken EST sequences.
19. Ciona intestinalis
Version N50 Core genes
v1.95 234,500 444
v2.0 2,571,800 425
Sometimes genomes get updates and assemblies are given a new version number. This might
be associated with an increase in average scaffold size, but sometimes the number of core
genes gets reduced.
20. Caenorhabditis sp. PS1010
Version N50 Core genes
v4 9,446 454
v5 64,074 428
People can easily measure things like N50, harder to measure things like what genes are
present (though people can use our free CEGMA tool!)
21. S. cerevisiae
Changes due to genome sequence changes in Feb 2011
caused changes to 194 protein sequences.
Last correction to gene structure due to mis-annotation
was in Jan 2010
So just 13 years to produce a
stable gene set!
Even in a simpler genome, the work of annotation goes on.
Bear in mind that many model organism databases often split genes into different categories
based on evidence.
23. ‘Finished’ eukaryotic genome
sequences are not finished!
except maybe yeast
Not that this matters necessarily. 1% of a genome is better than no genome at all. At some
level, the law of diminishing returns set it. Ideally, we could produce a metric of ‘useful
papers published per person-hour of database curator working on model organism
database’.
Just be aware that the genome you download today may change in future and your results
might not always be easily reproducible by someone using a different version.
24. CAP criteria
1) Complete
2) Accessible
3) Permanent
Sydney Brenner
Clearly they are not all complete.
As for accessibility, it not always easy to get hold of large datasets. Bandwidth represents a
particular problem (it can be almost impossible to download GenBank from east coast to
west coast using FTP). Also, online journals often end up breaking links to getting
supplemental material.
For the most part, they are permanent. But not always the raw, unassembled read data.