SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Downloaden Sie, um offline zu lesen
tctttttatgattaaattaaattttcaaaacgtcgaaatcatttgactgtttgttcagaatgaacagagc
ctgtaaaagccagttggctgtataatcgcctgatattcggttcccacgtggattagattgattttcaaca
agaagttttataaatttttttgtttaaaattttgaatatttggatctgaaaaaattaaagtttgatgatt
cgaaaattttctggaaaagttctttcagtaaaaactttttttcaactttttgattttttttccgcatttt
gtttttgaattattttcctgatttttttcgattaataaatttgtaaaaacaattttttttctaatttttg
gttttgatgattgtgttttttttctgaactttcgctaaaaaattgttcgatttttcccgaattaagaaaa
atattatttggtcatggcctagagtatgcagcgtggcctagaaattcctaacgtggcctaattgcaaaaa
aaagatttgaaaactagtatttaccctaaaattgcattttccgaatttaccttttttaaatttaattttc


                                         me
aattcaggcaaactgacgataatattgttcgattacccctttttatcaattattttcttcaatttcttat


                                   ge no
tccaattttcagatttaaaaaaatttaaaaaggaatgaacttttccaaagaaacatttaaaaaatcaaga


                                sa
tttttcagtagataatgatgaaatttagcagattttctgataaaaaattgaatttttttggatgaaatta


                           en i
attttttttaatagctctttatttttttgaaaatttctcccatcccttcgcaccctttagcaacaaccaa

                         Wh       he d?
atttatacagttttatgaaaaggtcacttttcgacgtttttcgccttttcgtggctcacaaaaataatga

                               nis
aatttattttctttttatgattaaattaaattttcaaaacgtcgaaatcatttgactgtttgttcagaat

                             fi
gaacagagcctgtaaaagccagttggctgtataatcgcctgatattcggttcccacgtggattagattga
ttttcaacaagaagttttataaatttttttgtttaaaattttgaatatttggatctgaaaaaattaaagt
ttgatgattcgaaaattttctggaaaagttctttcagtaaaaactttttttcaactttttgatttttttt
ccgcattttgtttttgaattattttcctgatttttttcgattaataaatttgtaaaaacaattttttttc
taatttttggttttgatgattgtgttttttttctgaactttcgctaaaaaattgttcgatttttagttat
ttggtcatggcctagagtatgcagcgtggcctagaaattcctaacgtggcctaattgcaaaaaaaagatt
tgaaaactagtatttaccctaaaattgcattttccgaatttaccttttttaaatttaattttcaattcag
gcaaactgacgataatattgttcgattacccctttttatcaattattttcttcaatttcttattccaatt
ttcagatttaaaaaaatttaaaaaggaatgaacttttccaaagaaacatttaaaaaatcaagatttttca
attttctctgaattcctgcagataatgatgaaatttagcagattttctgataaaaaattgaatttttttg
                                                                                                Keith Bradnam
gatgaaattaattttttttaatagctctttatttttttgaaaatttctcccatcccttcgcagcccttta
These slides and notes are licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License.
                                  gcaacaaccaaatttatacagttttatgaaaat
A talk given to the UC Davis Bits & Bites club, based on an earlier lecture I had given at UC Davis.

Keith Bradnam, March 2011
Part 1 - the sequence




We can think of ‘genome completion’ as referring to the sequence and/or the set of gene
annotations. Let’s start with the sequence.
A brief history of genomics
                                           Wu & Taylor determine the first ever
               1971                         DNA sequence (all 12 bp of it!)

                                          Sanger et al. sequence the first ever
               1977                      (DNA-based) virus genome - 5,375 bp

                                      First complete bacterial genome sequence
               1995                       (Haemophilus influenzae) - 1.83 Mb

                                             First complete eukaryotic genome
               1996                          (Saccharomyces cerevisiae) - 12 Mb

                                                    First animal genome
               1998                           (Caenorhabditis elegans) - 100 Mb

It took 18 years before we knew the structure of DNA before anyone could sequence it. First DNA sequence was from the end of a
bacteriophage lambda virus (written in a 20 page paper). First genome was actually an RNA viral genome determined in 1975 by Fiers
et al. The 1980’s and 1990’s saw the start of widespread DNA sequencing for genes of interest in species of interest. Moving to
eukaryotic genome sequencing means determining multiple chromosomes, and tackling bigger repeats (more assembly problems).
genomesonline.org
        6000



        4500



        3000             3,077                                    7,732

        1500



            0
                            Complete                          Incomplete


                      Bacteria             Archaea              Eukaryotes

Genomesonline.org tries to track all of the major genome projects out there. A lot of them
are flagged as incomplete, and maybe some of those will never reach ‘completion’ status.
CAP criteria
   1) Complete
   2) Accessible
   3) Permanent




                                                              Sydney Brenner
The great biologist may have won a Nobel prize for his work on development, he may have
postulated the very existence of mRNA, and he may have co-discovered the triplet code ...
but he also came up with the CAP criteria.

These criteria could pertain to any large scale academic project, but they conceived with
reference to genome sequencing projects.
Homo sapiens

                                     2000 - ‘working draft’ announced

                                     2001 - ‘working draft’ published

                                     2003 - ‘Finished’ version announced

                                     2006 - Last chromosome finished

                                     So it’s finished now right?

                                     Ns make up ~9% of current genome


The human genome has been finished on several different dates, depending how you define
‘finished’. Ns – unknown bases – still account for 9% of the 3.1 Gbp genome.
Drosophila melanogaster



                                   2000 - genome published

                                   ~175 MB genome
                                   So it’s finished now right?
                                   Ns make up ~4% of current genome




Drosophila is a much smaller genome, but a third of the genome is represented by the
harder-to-sequence heterochromatin. This was the subject of a separate genome project that
didn’t finish until 2007.

The genome still has many Ns.
Arabidopsis thaliana

                                                   Published 2000

                                                   115 Mb sequenced,
                                                   125 Mb genome

                                                   As of 2007...
                                                   119 Mb sequenced,
                                                   157 Mb genome

                                                   As of 2012...
                                                   119 Mb sequenced,
                                                   135 Mb genome
                                                   N’s make up ~0.2% of current genome
Many published genome sizes are sometimes based on estimates which can be wrong. As they sequenced more and more
of the Arabidopsis genome, they had to revise how big it was. So between 2000 and 2007 they produced more sequence
but paradoxically it became less complete.

This illustrates the difficulty of estimating genome size. The latest figures suggest that the genome is smaller again. Note
that much of this missing genome is not present as Ns in sequence you download. But the part you can still download
still has many unknown bases.
Caenorhabditis elegans

                                                  1998 - ‘finished’ genome published

                                                  97 100 MB genome

                                                  2002 - last gap closed




Genome information for species such as C. elegans are curated by model organism databases (MODs) that ensure that
the work goes on long after the initial publication announcing a ‘finished’ genome is made.

Genome size was quickly revised from 97 MB to 100 MB not long after publication.
Where’s my gene???




2002              2001                                             2000                                                      1997




People will often know that their gene of interest is definitely present in a genome through traditional genetic experiments...however, it
might not be present in the published genome sequence. The figure shows the times at which one end of chromosome X of C.
elegans were finished. The last 20 kbp region wasn’t finished until four years after the genome was published in 1998. This region
contained predicted genes...maybe scientists were working on these genes waiting for the sequence.
Caenorhabditis elegans

                                                    1998 - ‘finished’ genome published

                                                    97 100 MB genome

                                                    2002 - last gap closed

                                                    2004 - last N removed
                                                    So it’s finished now right?



Unlike the previous genomes, C. elegans has no Ns (but this took 6 years after publication to achieve).
Worm genome progress

                                   100,000,000




                                    80,000,000
                Genome size (bp)




                                    60,000,000




                                    40,000,000




                                    20,000,000




                                             0
                                            Jan-91   Dec-92   Nov-94   Oct-96   Sep-98    Aug-00   Jul-02   Jun-04   May-06
                                                                                   Date




At a gross level, it looks like the worm genome did not change much after the year 2000....
Worm genome progress
                              100,280,000




                                                                                                                   66 nt added
                              100,260,000
                                                                                                                    May 2010
           Genome size (bp)




                              100,240,000




                              100,220,000
                                       Sep-01   Jul-02   May-03      Mar-04          Dec-04         Oct-05
                                                                     Date



Here is a zoom in of the years 2001–2005...still lots of sequence changes happening. The last change on this graph represents a
very small addition of 66 bp to the genome. Maybe this change will not make any difference to anyone in the world, but it still makes
the genome sequence more accurate and closer to the biological truth

Not many genome projects are this devoted!
Saccharomyces cerevisiae

                                                 Published 1997

                                                 12 MB genome

                                                 No gaps, no N’s

                                                 So it’s finished now right?

                                                 1,653 genome changes made since 1997

                                                 Last change made in February 2011


Like C. elegans, yeast is a species which benefits from coordinated efforts to finish the genome.

In February 2011, the yeast genome sequence underwent corrections that affected 194 proteins. This happened in a – by
today’s standards – tiny genome which has been studied and curated for 15 years! What hope for larger, more complex
genomes?
Part 2 - annotations




Maybe you don’t care about the state of the genome, as long as you have all of the genes
present.
C. elegans annotations
                              Genes                       Proteins
       25000



       23500



       22000



       20500



       19000
           1998    2003    2004   2005    2006   2007   2008    2009   2010    2011


                Genome publication
Since publication, the number of protein-coding loci in C. elegans has risen by about 1,500
genes. But the number of proteins that might arise from alternatively spliced products is
much, much higher and shows no signs of slowing down.
C. elegans annotations
                      Genes               Proteins             RNA genes
       25000



       18750



       12500



        6250



           0
           1998    2003    2004   2005   2006    2007   2008    2009   2010   2011


                Genome publication
When we consider RNA genes, it is surprising that there are now more RNA genes than
protein-coding genes. How many more species have similar secrets in their genomes that
have yet to be discovered, mostly because of our historical focus on protein-coding genes.
Core genes
            You can identify ‘core’ genes, that are highly conserved
            and that should be present in all species

            Our group identified a set of 458 core genes from 6
            reference genomes:
              Homo sapiens
              Caenorhabditis elegans
              Drosophila melanogaster
              Arabidopsis thaliana
              Saccharomyces cerevisiae
              Schizosaccharomyces pombe

            We can then test whether these are all present in any
            ‘finished’ genome.
Our lab developed a set of 458 ‘core genes’ that we believe should be present in every
(complete) eukaryotic genome.

In the past we’ve discovered that many published genomes are missing some of these genes
from the genome sequence, even though they should be there. E.g. chicken has missing core
genes even though those genes are represented by chicken EST sequences.
Ciona intestinalis


              Version                    N50               Core genes


                v1.95                234,500                     444


                 v2.0               2,571,800                    425



Sometimes genomes get updates and assemblies are given a new version number. This might
be associated with an increase in average scaffold size, but sometimes the number of core
genes gets reduced.
Caenorhabditis sp. PS1010


               Version                    N50                Core genes


                   v4                    9,446                     454


                   v5                   64,074                     428



People can easily measure things like N50, harder to measure things like what genes are
present (though people can use our free CEGMA tool!)
S. cerevisiae
         Changes due to genome sequence changes in Feb 2011
         caused changes to 194 protein sequences.

         Last correction to gene structure due to mis-annotation
         was in Jan 2010

         So just 13 years to produce a
         stable gene set!




Even in a simpler genome, the work of annotation goes on.

Bear in mind that many model organism databases often split genes into different categories
based on evidence.
Conclusions
‘Finished’ eukaryotic genome
               sequences are not finished!

                               except maybe yeast




Not that this matters necessarily. 1% of a genome is better than no genome at all. At some
level, the law of diminishing returns set it. Ideally, we could produce a metric of ‘useful
papers published per person-hour of database curator working on model organism
database’.

Just be aware that the genome you download today may change in future and your results
might not always be easily reproducible by someone using a different version.
CAP criteria
   1) Complete
   2) Accessible
   3) Permanent




                                                               Sydney Brenner
Clearly they are not all complete.

As for accessibility, it not always easy to get hold of large datasets. Bandwidth represents a
particular problem (it can be almost impossible to download GenBank from east coast to
west coast using FTP). Also, online journals often end up breaking links to getting
supplemental material.

For the most part, they are permanent. But not always the raw, unassembled read data.
The End

Weitere ähnliche Inhalte

Was ist angesagt?

Whole genome sequencing of arabidopsis thaliana
Whole genome sequencing of arabidopsis thalianaWhole genome sequencing of arabidopsis thaliana
Whole genome sequencing of arabidopsis thalianaBhavya Sree
 
The Human Genome Project - Part III
The Human Genome Project - Part IIIThe Human Genome Project - Part III
The Human Genome Project - Part IIIhhalhaddad
 
The language of life (all the subtitles)first ppt 2 bimester
The language of life (all the subtitles)first ppt 2 bimesterThe language of life (all the subtitles)first ppt 2 bimester
The language of life (all the subtitles)first ppt 2 bimesterSofia Paz
 
Genomics and proteomics (Bioinformatics)
Genomics and proteomics (Bioinformatics)Genomics and proteomics (Bioinformatics)
Genomics and proteomics (Bioinformatics)Sijo A
 
The Human Genome Project - Part I
The Human Genome Project - Part IThe Human Genome Project - Part I
The Human Genome Project - Part Ihhalhaddad
 
Application of genomics in animals
Application of genomics in animalsApplication of genomics in animals
Application of genomics in animalsUsman Arshad
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomicskiran singh
 
Plant genomics general overview
Plant genomics general overviewPlant genomics general overview
Plant genomics general overviewKAUSHAL SAHU
 
Genome sequencing
Genome sequencingGenome sequencing
Genome sequencingShital Pal
 
Bio153 microbial genomics 2012
Bio153 microbial genomics 2012Bio153 microbial genomics 2012
Bio153 microbial genomics 2012Mark Pallen
 
Introduction to genomes
Introduction to genomesIntroduction to genomes
Introduction to genomesavrilcoghlan
 

Was ist angesagt? (20)

Whole genome sequencing of arabidopsis thaliana
Whole genome sequencing of arabidopsis thalianaWhole genome sequencing of arabidopsis thaliana
Whole genome sequencing of arabidopsis thaliana
 
Yeast Genome
Yeast Genome Yeast Genome
Yeast Genome
 
The Human Genome Project - Part III
The Human Genome Project - Part IIIThe Human Genome Project - Part III
The Human Genome Project - Part III
 
Human Genome Project
Human Genome ProjectHuman Genome Project
Human Genome Project
 
The language of life (all the subtitles)first ppt 2 bimester
The language of life (all the subtitles)first ppt 2 bimesterThe language of life (all the subtitles)first ppt 2 bimester
The language of life (all the subtitles)first ppt 2 bimester
 
Genomics seminar copy
Genomics seminar   copyGenomics seminar   copy
Genomics seminar copy
 
Genomics
GenomicsGenomics
Genomics
 
Genomics and proteomics (Bioinformatics)
Genomics and proteomics (Bioinformatics)Genomics and proteomics (Bioinformatics)
Genomics and proteomics (Bioinformatics)
 
The Human Genome Project - Part I
The Human Genome Project - Part IThe Human Genome Project - Part I
The Human Genome Project - Part I
 
Application of genomics in animals
Application of genomics in animalsApplication of genomics in animals
Application of genomics in animals
 
Plant genome project(aribidopsis)
Plant genome project(aribidopsis)Plant genome project(aribidopsis)
Plant genome project(aribidopsis)
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Plant genomics general overview
Plant genomics general overviewPlant genomics general overview
Plant genomics general overview
 
Genome analysis
Genome analysisGenome analysis
Genome analysis
 
Human Genome
Human Genome Human Genome
Human Genome
 
Genome sequencing
Genome sequencingGenome sequencing
Genome sequencing
 
Bio153 microbial genomics 2012
Bio153 microbial genomics 2012Bio153 microbial genomics 2012
Bio153 microbial genomics 2012
 
Introduction to genomes
Introduction to genomesIntroduction to genomes
Introduction to genomes
 
Types of genomics ppt
Types of genomics pptTypes of genomics ppt
Types of genomics ppt
 
THE human genome
THE human genomeTHE human genome
THE human genome
 

Ähnlich wie When is a genome finished?

Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Keith Bradnam
 
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...Keith Bradnam
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-researchc.titus.brown
 
Lets Make a Mammoth
Lets Make a Mammoth  Lets Make a Mammoth
Lets Make a Mammoth Cheche Salas
 
Clase 2 - Genoma Humano proyecto conicet.pdf
Clase 2 - Genoma Humano proyecto conicet.pdfClase 2 - Genoma Humano proyecto conicet.pdf
Clase 2 - Genoma Humano proyecto conicet.pdfNoraCRuizGuevara
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotesc.titus.brown
 
Overview on arabidopsis and rice genome
Overview on arabidopsis and rice genomeOverview on arabidopsis and rice genome
Overview on arabidopsis and rice genomeGopal Singh
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011c.titus.brown
 
Sci cafe humangenome&health
Sci cafe humangenome&healthSci cafe humangenome&health
Sci cafe humangenome&healthToby Rossman
 
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...Copenhagenomics
 
Credit seminar on rice genomics crrected
Credit seminar on rice genomics crrectedCredit seminar on rice genomics crrected
Credit seminar on rice genomics crrectedVarsha Gayatonde
 
Marzillier_09052014.pdf
Marzillier_09052014.pdfMarzillier_09052014.pdf
Marzillier_09052014.pdf7006ASWATHIRR
 
Story_of_HGP_As_Told_by_Front-line_Participant.pdf
Story_of_HGP_As_Told_by_Front-line_Participant.pdfStory_of_HGP_As_Told_by_Front-line_Participant.pdf
Story_of_HGP_As_Told_by_Front-line_Participant.pdfravindrasingh203141
 
Molecularbiology 090516221322-phpapp01
Molecularbiology 090516221322-phpapp01Molecularbiology 090516221322-phpapp01
Molecularbiology 090516221322-phpapp01Slindile Nyathi
 

Ähnlich wie When is a genome finished? (20)

2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...
 
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
Genomic Data Analysis
Genomic Data AnalysisGenomic Data Analysis
Genomic Data Analysis
 
Sweden_eemis_big_data
Sweden_eemis_big_dataSweden_eemis_big_data
Sweden_eemis_big_data
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-research
 
Lets Make a Mammoth
Lets Make a Mammoth  Lets Make a Mammoth
Lets Make a Mammoth
 
Clase 2 - Genoma Humano proyecto conicet.pdf
Clase 2 - Genoma Humano proyecto conicet.pdfClase 2 - Genoma Humano proyecto conicet.pdf
Clase 2 - Genoma Humano proyecto conicet.pdf
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
 
Overview on arabidopsis and rice genome
Overview on arabidopsis and rice genomeOverview on arabidopsis and rice genome
Overview on arabidopsis and rice genome
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011
 
Sci cafe humangenome&health
Sci cafe humangenome&healthSci cafe humangenome&health
Sci cafe humangenome&health
 
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
 
Credit seminar on rice genomics crrected
Credit seminar on rice genomics crrectedCredit seminar on rice genomics crrected
Credit seminar on rice genomics crrected
 
Human encodeproject
Human encodeprojectHuman encodeproject
Human encodeproject
 
Lecture 1,2
Lecture 1,2Lecture 1,2
Lecture 1,2
 
Marzillier_09052014.pdf
Marzillier_09052014.pdfMarzillier_09052014.pdf
Marzillier_09052014.pdf
 
Story_of_HGP_As_Told_by_Front-line_Participant.pdf
Story_of_HGP_As_Told_by_Front-line_Participant.pdfStory_of_HGP_As_Told_by_Front-line_Participant.pdf
Story_of_HGP_As_Told_by_Front-line_Participant.pdf
 
Molecularbiology 090516221322-phpapp01
Molecularbiology 090516221322-phpapp01Molecularbiology 090516221322-phpapp01
Molecularbiology 090516221322-phpapp01
 

Mehr von Keith Bradnam

13 questions you might have about galaxy
13 questions you might have about galaxy13 questions you might have about galaxy
13 questions you might have about galaxyKeith Bradnam
 
This bioinformatics lesson is brought to you by the letter 'W'
This bioinformatics lesson is brought to you by the letter 'W'This bioinformatics lesson is brought to you by the letter 'W'
This bioinformatics lesson is brought to you by the letter 'W'Keith Bradnam
 
This bioinformatics lesson is brought to you by the letter 'T'
This bioinformatics lesson is brought to you by the letter 'T'This bioinformatics lesson is brought to you by the letter 'T'
This bioinformatics lesson is brought to you by the letter 'T'Keith Bradnam
 
This bioinformatics lesson is brought to you by the letter 'D'
This bioinformatics lesson is brought to you by the letter 'D'This bioinformatics lesson is brought to you by the letter 'D'
This bioinformatics lesson is brought to you by the letter 'D'Keith Bradnam
 
Thoughts on the feasibility of an Assemblathon 3 contest
Thoughts on the feasibility of an Assemblathon 3 contestThoughts on the feasibility of an Assemblathon 3 contest
Thoughts on the feasibility of an Assemblathon 3 contestKeith Bradnam
 
Genome assembly: then and now (with notes) — v1.2
Genome assembly: then and now (with notes) — v1.2Genome assembly: then and now (with notes) — v1.2
Genome assembly: then and now (with notes) — v1.2Keith Bradnam
 
Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2Keith Bradnam
 
Genome assembly: then and now — v1.1
Genome assembly: then and now — v1.1Genome assembly: then and now — v1.1
Genome assembly: then and now — v1.1Keith Bradnam
 
Genome assembly: then and now — with notes — v1.1
Genome assembly: then and now — with notes — v1.1Genome assembly: then and now — with notes — v1.1
Genome assembly: then and now — with notes — v1.1Keith Bradnam
 
What's in a name? Better vocabularies = better bioinformatics?
What's in a name? Better vocabularies = better bioinformatics?What's in a name? Better vocabularies = better bioinformatics?
What's in a name? Better vocabularies = better bioinformatics?Keith Bradnam
 
The art of good science writing
The art of good science writingThe art of good science writing
The art of good science writingKeith Bradnam
 
Genome assembly: then and now — v1.0
Genome assembly: then and now — v1.0Genome assembly: then and now — v1.0
Genome assembly: then and now — v1.0Keith Bradnam
 
Polish that presentation! 25 tips to bring clarity to your slides
Polish that presentation! 25 tips to bring clarity to your slidesPolish that presentation! 25 tips to bring clarity to your slides
Polish that presentation! 25 tips to bring clarity to your slidesKeith Bradnam
 
10 tips for adding polish to presentations
10 tips for adding polish to presentations10 tips for adding polish to presentations
10 tips for adding polish to presentationsKeith Bradnam
 
Database talk for Bits & Bites meeting
Database talk for Bits & Bites meetingDatabase talk for Bits & Bites meeting
Database talk for Bits & Bites meetingKeith Bradnam
 
Benchmarking short-read mapping programs
Benchmarking short-read mapping programsBenchmarking short-read mapping programs
Benchmarking short-read mapping programsKeith Bradnam
 
Thoughts on the recent announcements by Oxford Nanopore Technologies
Thoughts on the recent announcements by Oxford Nanopore TechnologiesThoughts on the recent announcements by Oxford Nanopore Technologies
Thoughts on the recent announcements by Oxford Nanopore TechnologiesKeith Bradnam
 
Twitter 101 - an introduction to Twitter
Twitter 101  - an introduction to TwitterTwitter 101  - an introduction to Twitter
Twitter 101 - an introduction to TwitterKeith Bradnam
 

Mehr von Keith Bradnam (18)

13 questions you might have about galaxy
13 questions you might have about galaxy13 questions you might have about galaxy
13 questions you might have about galaxy
 
This bioinformatics lesson is brought to you by the letter 'W'
This bioinformatics lesson is brought to you by the letter 'W'This bioinformatics lesson is brought to you by the letter 'W'
This bioinformatics lesson is brought to you by the letter 'W'
 
This bioinformatics lesson is brought to you by the letter 'T'
This bioinformatics lesson is brought to you by the letter 'T'This bioinformatics lesson is brought to you by the letter 'T'
This bioinformatics lesson is brought to you by the letter 'T'
 
This bioinformatics lesson is brought to you by the letter 'D'
This bioinformatics lesson is brought to you by the letter 'D'This bioinformatics lesson is brought to you by the letter 'D'
This bioinformatics lesson is brought to you by the letter 'D'
 
Thoughts on the feasibility of an Assemblathon 3 contest
Thoughts on the feasibility of an Assemblathon 3 contestThoughts on the feasibility of an Assemblathon 3 contest
Thoughts on the feasibility of an Assemblathon 3 contest
 
Genome assembly: then and now (with notes) — v1.2
Genome assembly: then and now (with notes) — v1.2Genome assembly: then and now (with notes) — v1.2
Genome assembly: then and now (with notes) — v1.2
 
Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2
 
Genome assembly: then and now — v1.1
Genome assembly: then and now — v1.1Genome assembly: then and now — v1.1
Genome assembly: then and now — v1.1
 
Genome assembly: then and now — with notes — v1.1
Genome assembly: then and now — with notes — v1.1Genome assembly: then and now — with notes — v1.1
Genome assembly: then and now — with notes — v1.1
 
What's in a name? Better vocabularies = better bioinformatics?
What's in a name? Better vocabularies = better bioinformatics?What's in a name? Better vocabularies = better bioinformatics?
What's in a name? Better vocabularies = better bioinformatics?
 
The art of good science writing
The art of good science writingThe art of good science writing
The art of good science writing
 
Genome assembly: then and now — v1.0
Genome assembly: then and now — v1.0Genome assembly: then and now — v1.0
Genome assembly: then and now — v1.0
 
Polish that presentation! 25 tips to bring clarity to your slides
Polish that presentation! 25 tips to bring clarity to your slidesPolish that presentation! 25 tips to bring clarity to your slides
Polish that presentation! 25 tips to bring clarity to your slides
 
10 tips for adding polish to presentations
10 tips for adding polish to presentations10 tips for adding polish to presentations
10 tips for adding polish to presentations
 
Database talk for Bits & Bites meeting
Database talk for Bits & Bites meetingDatabase talk for Bits & Bites meeting
Database talk for Bits & Bites meeting
 
Benchmarking short-read mapping programs
Benchmarking short-read mapping programsBenchmarking short-read mapping programs
Benchmarking short-read mapping programs
 
Thoughts on the recent announcements by Oxford Nanopore Technologies
Thoughts on the recent announcements by Oxford Nanopore TechnologiesThoughts on the recent announcements by Oxford Nanopore Technologies
Thoughts on the recent announcements by Oxford Nanopore Technologies
 
Twitter 101 - an introduction to Twitter
Twitter 101  - an introduction to TwitterTwitter 101  - an introduction to Twitter
Twitter 101 - an introduction to Twitter
 

Kürzlich hochgeladen

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Kürzlich hochgeladen (20)

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 

When is a genome finished?

  • 1. tctttttatgattaaattaaattttcaaaacgtcgaaatcatttgactgtttgttcagaatgaacagagc ctgtaaaagccagttggctgtataatcgcctgatattcggttcccacgtggattagattgattttcaaca agaagttttataaatttttttgtttaaaattttgaatatttggatctgaaaaaattaaagtttgatgatt cgaaaattttctggaaaagttctttcagtaaaaactttttttcaactttttgattttttttccgcatttt gtttttgaattattttcctgatttttttcgattaataaatttgtaaaaacaattttttttctaatttttg gttttgatgattgtgttttttttctgaactttcgctaaaaaattgttcgatttttcccgaattaagaaaa atattatttggtcatggcctagagtatgcagcgtggcctagaaattcctaacgtggcctaattgcaaaaa aaagatttgaaaactagtatttaccctaaaattgcattttccgaatttaccttttttaaatttaattttc me aattcaggcaaactgacgataatattgttcgattacccctttttatcaattattttcttcaatttcttat ge no tccaattttcagatttaaaaaaatttaaaaaggaatgaacttttccaaagaaacatttaaaaaatcaaga sa tttttcagtagataatgatgaaatttagcagattttctgataaaaaattgaatttttttggatgaaatta en i attttttttaatagctctttatttttttgaaaatttctcccatcccttcgcaccctttagcaacaaccaa Wh he d? atttatacagttttatgaaaaggtcacttttcgacgtttttcgccttttcgtggctcacaaaaataatga nis aatttattttctttttatgattaaattaaattttcaaaacgtcgaaatcatttgactgtttgttcagaat fi gaacagagcctgtaaaagccagttggctgtataatcgcctgatattcggttcccacgtggattagattga ttttcaacaagaagttttataaatttttttgtttaaaattttgaatatttggatctgaaaaaattaaagt ttgatgattcgaaaattttctggaaaagttctttcagtaaaaactttttttcaactttttgatttttttt ccgcattttgtttttgaattattttcctgatttttttcgattaataaatttgtaaaaacaattttttttc taatttttggttttgatgattgtgttttttttctgaactttcgctaaaaaattgttcgatttttagttat ttggtcatggcctagagtatgcagcgtggcctagaaattcctaacgtggcctaattgcaaaaaaaagatt tgaaaactagtatttaccctaaaattgcattttccgaatttaccttttttaaatttaattttcaattcag gcaaactgacgataatattgttcgattacccctttttatcaattattttcttcaatttcttattccaatt ttcagatttaaaaaaatttaaaaaggaatgaacttttccaaagaaacatttaaaaaatcaagatttttca attttctctgaattcctgcagataatgatgaaatttagcagattttctgataaaaaattgaatttttttg Keith Bradnam gatgaaattaattttttttaatagctctttatttttttgaaaatttctcccatcccttcgcagcccttta These slides and notes are licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License. gcaacaaccaaatttatacagttttatgaaaat A talk given to the UC Davis Bits & Bites club, based on an earlier lecture I had given at UC Davis. Keith Bradnam, March 2011
  • 2. Part 1 - the sequence We can think of ‘genome completion’ as referring to the sequence and/or the set of gene annotations. Let’s start with the sequence.
  • 3. A brief history of genomics Wu & Taylor determine the first ever 1971 DNA sequence (all 12 bp of it!) Sanger et al. sequence the first ever 1977 (DNA-based) virus genome - 5,375 bp First complete bacterial genome sequence 1995 (Haemophilus influenzae) - 1.83 Mb First complete eukaryotic genome 1996 (Saccharomyces cerevisiae) - 12 Mb First animal genome 1998 (Caenorhabditis elegans) - 100 Mb It took 18 years before we knew the structure of DNA before anyone could sequence it. First DNA sequence was from the end of a bacteriophage lambda virus (written in a 20 page paper). First genome was actually an RNA viral genome determined in 1975 by Fiers et al. The 1980’s and 1990’s saw the start of widespread DNA sequencing for genes of interest in species of interest. Moving to eukaryotic genome sequencing means determining multiple chromosomes, and tackling bigger repeats (more assembly problems).
  • 4. genomesonline.org 6000 4500 3000 3,077 7,732 1500 0 Complete Incomplete Bacteria Archaea Eukaryotes Genomesonline.org tries to track all of the major genome projects out there. A lot of them are flagged as incomplete, and maybe some of those will never reach ‘completion’ status.
  • 5. CAP criteria 1) Complete 2) Accessible 3) Permanent Sydney Brenner The great biologist may have won a Nobel prize for his work on development, he may have postulated the very existence of mRNA, and he may have co-discovered the triplet code ... but he also came up with the CAP criteria. These criteria could pertain to any large scale academic project, but they conceived with reference to genome sequencing projects.
  • 6. Homo sapiens 2000 - ‘working draft’ announced 2001 - ‘working draft’ published 2003 - ‘Finished’ version announced 2006 - Last chromosome finished So it’s finished now right? Ns make up ~9% of current genome The human genome has been finished on several different dates, depending how you define ‘finished’. Ns – unknown bases – still account for 9% of the 3.1 Gbp genome.
  • 7. Drosophila melanogaster 2000 - genome published ~175 MB genome So it’s finished now right? Ns make up ~4% of current genome Drosophila is a much smaller genome, but a third of the genome is represented by the harder-to-sequence heterochromatin. This was the subject of a separate genome project that didn’t finish until 2007. The genome still has many Ns.
  • 8. Arabidopsis thaliana Published 2000 115 Mb sequenced, 125 Mb genome As of 2007... 119 Mb sequenced, 157 Mb genome As of 2012... 119 Mb sequenced, 135 Mb genome N’s make up ~0.2% of current genome Many published genome sizes are sometimes based on estimates which can be wrong. As they sequenced more and more of the Arabidopsis genome, they had to revise how big it was. So between 2000 and 2007 they produced more sequence but paradoxically it became less complete. This illustrates the difficulty of estimating genome size. The latest figures suggest that the genome is smaller again. Note that much of this missing genome is not present as Ns in sequence you download. But the part you can still download still has many unknown bases.
  • 9. Caenorhabditis elegans 1998 - ‘finished’ genome published 97 100 MB genome 2002 - last gap closed Genome information for species such as C. elegans are curated by model organism databases (MODs) that ensure that the work goes on long after the initial publication announcing a ‘finished’ genome is made. Genome size was quickly revised from 97 MB to 100 MB not long after publication.
  • 10. Where’s my gene??? 2002 2001 2000 1997 People will often know that their gene of interest is definitely present in a genome through traditional genetic experiments...however, it might not be present in the published genome sequence. The figure shows the times at which one end of chromosome X of C. elegans were finished. The last 20 kbp region wasn’t finished until four years after the genome was published in 1998. This region contained predicted genes...maybe scientists were working on these genes waiting for the sequence.
  • 11. Caenorhabditis elegans 1998 - ‘finished’ genome published 97 100 MB genome 2002 - last gap closed 2004 - last N removed So it’s finished now right? Unlike the previous genomes, C. elegans has no Ns (but this took 6 years after publication to achieve).
  • 12. Worm genome progress 100,000,000 80,000,000 Genome size (bp) 60,000,000 40,000,000 20,000,000 0 Jan-91 Dec-92 Nov-94 Oct-96 Sep-98 Aug-00 Jul-02 Jun-04 May-06 Date At a gross level, it looks like the worm genome did not change much after the year 2000....
  • 13. Worm genome progress 100,280,000 66 nt added 100,260,000 May 2010 Genome size (bp) 100,240,000 100,220,000 Sep-01 Jul-02 May-03 Mar-04 Dec-04 Oct-05 Date Here is a zoom in of the years 2001–2005...still lots of sequence changes happening. The last change on this graph represents a very small addition of 66 bp to the genome. Maybe this change will not make any difference to anyone in the world, but it still makes the genome sequence more accurate and closer to the biological truth Not many genome projects are this devoted!
  • 14. Saccharomyces cerevisiae Published 1997 12 MB genome No gaps, no N’s So it’s finished now right? 1,653 genome changes made since 1997 Last change made in February 2011 Like C. elegans, yeast is a species which benefits from coordinated efforts to finish the genome. In February 2011, the yeast genome sequence underwent corrections that affected 194 proteins. This happened in a – by today’s standards – tiny genome which has been studied and curated for 15 years! What hope for larger, more complex genomes?
  • 15. Part 2 - annotations Maybe you don’t care about the state of the genome, as long as you have all of the genes present.
  • 16. C. elegans annotations Genes Proteins 25000 23500 22000 20500 19000 1998 2003 2004 2005 2006 2007 2008 2009 2010 2011 Genome publication Since publication, the number of protein-coding loci in C. elegans has risen by about 1,500 genes. But the number of proteins that might arise from alternatively spliced products is much, much higher and shows no signs of slowing down.
  • 17. C. elegans annotations Genes Proteins RNA genes 25000 18750 12500 6250 0 1998 2003 2004 2005 2006 2007 2008 2009 2010 2011 Genome publication When we consider RNA genes, it is surprising that there are now more RNA genes than protein-coding genes. How many more species have similar secrets in their genomes that have yet to be discovered, mostly because of our historical focus on protein-coding genes.
  • 18. Core genes You can identify ‘core’ genes, that are highly conserved and that should be present in all species Our group identified a set of 458 core genes from 6 reference genomes: Homo sapiens Caenorhabditis elegans Drosophila melanogaster Arabidopsis thaliana Saccharomyces cerevisiae Schizosaccharomyces pombe We can then test whether these are all present in any ‘finished’ genome. Our lab developed a set of 458 ‘core genes’ that we believe should be present in every (complete) eukaryotic genome. In the past we’ve discovered that many published genomes are missing some of these genes from the genome sequence, even though they should be there. E.g. chicken has missing core genes even though those genes are represented by chicken EST sequences.
  • 19. Ciona intestinalis Version N50 Core genes v1.95 234,500 444 v2.0 2,571,800 425 Sometimes genomes get updates and assemblies are given a new version number. This might be associated with an increase in average scaffold size, but sometimes the number of core genes gets reduced.
  • 20. Caenorhabditis sp. PS1010 Version N50 Core genes v4 9,446 454 v5 64,074 428 People can easily measure things like N50, harder to measure things like what genes are present (though people can use our free CEGMA tool!)
  • 21. S. cerevisiae Changes due to genome sequence changes in Feb 2011 caused changes to 194 protein sequences. Last correction to gene structure due to mis-annotation was in Jan 2010 So just 13 years to produce a stable gene set! Even in a simpler genome, the work of annotation goes on. Bear in mind that many model organism databases often split genes into different categories based on evidence.
  • 23. ‘Finished’ eukaryotic genome sequences are not finished! except maybe yeast Not that this matters necessarily. 1% of a genome is better than no genome at all. At some level, the law of diminishing returns set it. Ideally, we could produce a metric of ‘useful papers published per person-hour of database curator working on model organism database’. Just be aware that the genome you download today may change in future and your results might not always be easily reproducible by someone using a different version.
  • 24. CAP criteria 1) Complete 2) Accessible 3) Permanent Sydney Brenner Clearly they are not all complete. As for accessibility, it not always easy to get hold of large datasets. Bandwidth represents a particular problem (it can be almost impossible to download GenBank from east coast to west coast using FTP). Also, online journals often end up breaking links to getting supplemental material. For the most part, they are permanent. But not always the raw, unassembled read data.