Yeast genome project

Introduction
•

Saccharomyces cerevisiae

•
•
•
•
•
•




It is perhaps the most useful diploid yeast, having been instrumental to winemaking, baking and brewing since
ancient times
It is one of the most intensively studied eukaryotic model organisms in molecular and cell biology
Size: 5- 10 µm in diameter
Sequenced in year: 1996
Strain sequenced: S288C
Databases:
Munich Information Centre for Protein Sequences (MIPS): http://www.mips.biochem.mpg.de/mips/yeast/
Yeast Protein Database (YPD): http://quest7.proteome.com/YPDhome.html
Saccharomyces Genome Database (SGD): http://genome-www.stanford.edu/Saccharomyces/

•

Schizosaccharomyces pombe (Fission yeast)

•
•
•
•

It is used as a model organism in molecular and cell biology
Size: 3 to 4 µm in diameter and 7 to 14 µm in length
Sequenced in year: 2002
Strain sequenced: 972h by European sequencing consortium (EUPOM) including 13 laboratories and Wellcome
Trust Sanger Institute; Cold Spring Harbor Laboratory
Databases:
PomBase: http://www.pombase.org
Broad Institute: Saccharomyces genome database:
http://www.broadinstitute.org/annotation/genome/schizosaccharomyces_group/MultiHome.html

•



• Candida albicans
• Most common human fungal pathogen
• It is diploid fungus that grows both as yeast and filamentous cells and a
causal agent of opportunistic oral and genital infections in humans
and candidal onychomycosis, an infection of the nail plate
• Size: 2.0-7.0 µm in diameter µm in length 3.0-8.5 µm in length
• Sequenced in year:2004 by consortia formed by Stanford technology
centre
• Strain sequenced: SC5314
• Databases:
 Candida database : http://www.candidagenome.org
 Broad Institute: Saccharomyces genome database:
http://www.broadinstitute.org/annotation/genome/candida_group/Multi
Home.html

•

The bakers yeast Saccharomyces cerevisae is the first eukaryote whose genome is
entirely sequenced

•

Mitochondrial DNA was sequenced in segments in the 1980s.

•

In 1989, it was decided to initiate a yeast sequencing project within the frame of
the EU biotechnology programmes, some 35 European laboratories became
initially involved in this enterprise [Vassarotti & Goffeau, 1992]

•

Chromosome III was the first chromosome to be completed in 1992 followed by XI
and II both in 1994

•

The 315kb sequence of yeast chromosome III was published, it was a remarkable
scientific landmark not only by being the first eukaryotic chromosome ever to be
sequenced, but primarily because it revealed the extent of what remained to
be understood in the genome of an otherwise extensively studied
organism, such as, Saccharomyces cerevisiae

•

Soon after its beginning, several other laboratories joined the project and agreed
upon an international collaboration that enabled the whole yeast genome
sequence to be finalized in 1995

•

More than 600 scientists in Europe, North America and Japan became involved in
this effort and the entire sequence was released in April 1996.

EU=55.9%, UK=17.6%, USA= 20.0%, Canada= 4.3%, Japan= 2.2%
Figure: Consortia involved in the yeast genome sequencing project

Cloning and Mapping Procedures:
• The sequencing of chromosome III started from a collection of
overlapping plasmid or phage lambda clones that were distributed by the
DNA coordinator to the contracting laboratories. However, it soon became
evident that ordered cosmid libraries were much more advantageous to
aid large scale sequencing.
• To construct a library with as complete coverage as possible with as few
clones as possible, the cloned DNA fragments should be randomly
distributed on the DNA.
• Under these conditions, the number of clones (N) in a library representing
each genomic segment with a given probability (P) is
N = ln (1-P)/ln (1-f)
where f is the insert length expressed as fraction of the genome size
[Clarke & Carbon, 1976].

• For example, with the size of 12,800 kb for the yeast genome and
assuming an average insert length of 35 kb, a cosmid library containing
4600 random clones would represent the yeast genome at P=99.99%, i.e.
about twelve times the genome equivalent

A low number of clones was of interest in setting up ordered yeast cosmid
libraries or specific sublibraries by sorting out from an unordered cosmid library
by colony hybridization using specific chromosomal DNA purified by pulsed-field
gel electrophoresis as a probe
The 'nested chromosomal fragmentation' method [Thierry & Dujon, 1992] was
then applied to rapid sorting of these clones
Finally, a set of overlapping cosmids was sufficient to build a contig of specific
chromosome
•

This approach has also been successfully applied to many of the other
chromosomes sequenced in the yeast genome project

•

To facilitate sequencing and assembly of the sequences, contigs of
overlapping cosmids and fine-resolution physical maps of the respective
chromosomes were constructed first, by application of classical mapping
methods (fingerprints, cross-hybridization) or by novel methods developed
for this programme, such as site-specific chromosome fragmentation [Thierry
& Dujon, 1992] or the high resolution cross-hybridization matrix [Scholler et
al., 1995]

Sequencing strategies and Sequence Assembly
•
•

In the European network, clones were distributed to the collaborating laboratories according to a
scheme worked out by the DNA coordinators
Each contracting laboratory was free to apply sequencing strategies and techniques of its own
provided that the sequences were entirely determined on both strands and unambiguous
readings were obtained

• Two principle approaches were used to prepare subclones for sequencing:
1)
generation of sub-libraries by the use of a series of appropriate restriction enzymes or from
nested deletions of appropriate sub-fragments made by exonuclease III
2)
generation of shotgun libraries from whole cosmids or subcloned fragments by random
shearing of the DNA
• Sequencing by the Sanger technique was done
1)
manually, labelling with [35S]dATP being the preferred method of monitoring
2)
by automated devices
• Two types of devices for on-line detection with fluorescence labeling were employed
1)
Applied Biosystems ABI373A
2)
Pharmacia A.L.F.
•

One laboratory used the direct blotting electrophoresis system from GATC company (Konstanz).
Similar procedures were applied to the sequencing of chromosomes outside the European
network. The American laboratories largely relied on machine-based large-scale sequencing.

Sequencing Telomeres
• The yeast chromosome telomeres presented a particular
problem

• Due to their repetitive sub-structures and the lack of
appropriate restriction sites they could be cloned by
conventional procedures with only a few exceptions
• Largely, telomeres were physically mapped relative to the
terminal-most cosmid inserts using the I-SceI chromosome
fragmentation procedure [Thierry & Dujon, 1992]
• The sequences were then determined from specific plasmid
clones obtained by 'telomere trap cloning', an elegant
strategy developed by E. Louis at Oxford [Louis, 1994; Louis
& Borts, 1995]

Sequence Assembly
•

Within the European network, all original sequences were submitted by the collaborating
laboratories to the Martinsried Institute of Protein Sequences (MIPS) which acted as an
informatics centre

•

The sequences were kept in a data library, assembled into progressively growing contigs, and
updated during the course of the project by the application of appropriate criteria in a number of
quality controls, starting with chromosome XI

•

In collaboration with the DNA coordinators the final chromosome sequences were derived. Also in
the other yeast chromosomes, automated procedures were employed for
sequence
assembly, based for example on the programpackage developed at
1)
Cambridge [e.g. Dear & Staden, 1991]
2)
ACeDB programdeveloped for the C. elegans genome project [Thierry-Mieg & Durbin, 1992]
•

In any case, correct assembly of the sequences was guaranteed by establishing that the order of
restriction sites predicted from the sequence was consistent with the physical maps of these sites
that had been determined independently and care was taken to perform quality controls that
would result in a high accuracy

•

From theoretical considerations taking all types of errors together, it follows that with an average
sequence accuracy of 99.9%

•

In practice, care was taken to minimize frameshift errors, which represented about two thirds of
all sequencing errors and thus would have the most deleterious effects on gene interpretation.
Meanwhile, all sequences have been systematically checked for errors again and were corrected in
the data libraries.

The sequences have been interpreted using the following principles:
i.
All intron splice site/branch-point pairs detected by using
specially defined patterns were listed
ii. All ORFs containing at least 100 contiguous sense codons and not
contained entirely in a longer ORF
iii. Centromere and telomere regions, as well as tRNA genes and Ty
elements or remnants thereof were sought by comparison with
previously characterized datasets
• FASTA BLASTX and FLASH1 in combination with the Protein
Sequence
Database of PIR-International and other public
databases
• Protein signatures were detected by using the PROSITE
dictionary, as well as BLOCKS and PRODOM domains
• Base composition; nucleotide pattern frequencies; GC profiles; ORF
distribution profiles were performed by using GCG programs or the
X11 program package
• For calculations of GC content of ORFs the algorithm CODONS was
used
• This information was compiled at the end of the sequencing project
to annotate all genetic elements in the yeast genome

Classification of S. cerevisiae genes

ORF sizes in the S. cerevisiae genome

 At the time, the yeast genome sequencing project had been
finalized, comparison of the total sequence with public databases
revealed:
• some 28.4% of the yeast ORFs corresponded either to previously
known protein-encoding genes or to genes whose functions have
been determined previously or during the course of the project
• An estimated 5.6% of the total remained questionable ORFs
• 66% of the total ORFs represented novel putative yeast genes
• 14.8% of the total had homologues among gene products from
yeast or other organisms whose functions are known
• 14.4% of the total had recognizable motifs or weak homologies to
genes of experimentally characterized functions.
• Remaining 37.7% of the total ORFs had either homologues to ORFs
of unknown function on other
• Thus, approximately 2200 of the yeast genes had to be categorized
as 'genes of unknown function', sometimes called ‘orphans’
 A most useful inventory of the yeast proteins had been compiled in
the Yeast Proteome Database (YPD) [Garrels et al., 1996] and is
updated regularly.

The mystery of orphans
•
•
•

•
•
•
•
•

•
•

o


‘Orphans’ are defined by the absence of known function and of structural homologs of known
function, so it seems only natural that, with time, they will vanish.
Functions of a few genes previously classified as orphans were reported during the sequencing
project itself
The most striking result from the chromosome III, sequence was that approximately half of
all protein-coding ORFs revealed by the sequence, had no clearcut sequence homologs in any
organisrn, including yeast itself
Thus, with right sequence of the first eukayotic chromosome, it was the discovery of the
extent of our ignorance, rather than the discovery of many new genes, that was the most
conspicuous finding
exact figures depend on stringency criteria applied to determine the significance of sequence
similarities
on average, 30-35% of all ORFs of the yeast genome are orphans.
Even in absence of homologs, computers can provide some clues about the nature of some
orphans.
For example, prediction of transmembrane segments resulted in the striking conclusion that up
to 35-40% of the predicted proteins from chromosome III have transmembrane helices.
Ultimately, the function of each sequence-predicted ORF can only be demonstrated by
experiments
total number of orphans in the yeast genome (about 2000)
It is clear that orphans by and large, are not fundamentally different from other yeast genes in
terms of expression.
If orphans are real genes, why were they not discovered before?
Genome redundancy is a possible explanation. As sequencing progressed, structural homologs to
earlier orphans were regularly discovered in the yeast genome. Statistically, however, there is no
indication that orphans tend to be more frequently duplicated than the genes previously
characterized by classical genetics or their structural homologs. If any-thing, the converse seems
to be true.

Gene Density and Gene Arrangement of Proteinencoding Genes in S. cerevisiae
•

From the number of genes and the total size of the yeast genome one arrives at a gene density

•

Gene density in all yeast chromosomes is rather similar

•

Excluding the ORFs contributed by the Ty elements, ORFs occupy on average 70% of the sequences.
This leaves only limited space for the intergenic regions which can be thought to harbour the
major regulatory elements involved in chromosome maintenance, DNA replication and
transcription.

•

The compact nature of the S. cerevisiae genome is apparent when compared to more complex
eukaryotic systems.

•
•

C. elegans contains a potential protein-encoding gene only every 5-6 kb [Hodgkin et al., 1994]
In the human genome, gene density had been estimated to be as low as one gene in 30 kb
[Olson, 1993] after the draft sequence is available, this figure is one gene in about 100 kb

•

Schizosaccharomyces pombe, possesses a lower gene density (one gene per 2.3 kb) than S.
cerevisiae. The difference between the two yeast genomes appears to be due to the fact that in the
fission yeast 40% of the genes contain introns, whereas only a minor fraction (< 5% of the proteinencoding genes in S. cerevisiae are found to be interrupted by introns

•
•

Generally, ORFs appear to be rather evenly distributed among the two strands of the single
chromosomes. In some chromosomes (e.g. I, II, VIII), there is a slight excess of coding capacity on
one of the strands, the significance of which is not known
Average base composition of yeast DNA is 38.4% (G+C)

• GC content of:
1.
protein coding (40.2%)
2.
non-coding regions (35.1%)
•
•

Coding regions are evenly distributed between the two strands
Average ORF size is 1450 bp

• The average sizes of inter-ORF regions vary between 630 and 945 bp for different chromosomes
1.
618 bp on average for 'divergent promoters' (36.2% GC)
2.
326 bp for 'convergent terminators' (29.3% GC)
3.
517 bp for 'promoter-terminator combinations' (34.2% GC)
•
•
•
•

Average base composition has been found to be symmetrical over the entire chromosomes
Base composition of ORFs themselves showing a significant excess of homopurine pairs on the
coding strand .
Regional variations of base composition with similar amplitudes were first noted along
chromosome III
A most interesting observation was that the compositional periodicity correlates with local gene
density, reaching more than 85% in GC-rich regions, followed by segments of comparably lower
gene density (50-55%) in AT-rich regions [Dujon et al., 1994].

 Functional elements of yeast chromosome:
1.
2.
3.

Centromere
Telomere
Origins of replication

 Complex and Simple repeats
•
•

yeast genome is remarkably poor in repeated sequences
unique constellation of repetitious sequences at the two ends of chromosome I is
found. Approximately 30 kb in each subtelomeric region carry similar (but nonessential) genes and a 15 kb repeat

•

these terminal regions represent the yeast equivalent to heterochromatin and the
occurrence of this type of DNA suggests that its presence gives this chromosome the
critical length required for proper stability and function

•

The 30 kb region can be removed from each end without affecting vegetative
growth, although chromosome stability is considerably reduced

•

Besides the Ty elements, it is the rDNA on chromosome XII that most significantly
contributes to repetitiveness. A cluster of some 15 tandem repeats (2 kb each)
containing the CUP1 gene and contributing to polymorphic variation is found on
chromosome VIII
Repeated stretches of short oligonucleotides exist. These include poly(A) or poly(T)
tracts, alternating poly(AT) or poly(TG) tracts, and direct or inverted long repeats

•

Genome Inventory of S. cerevisae

Graphical View of Protein Coding Genes of S. cerevisiae (as
of Nov 20, 2013)

Distribution of Gene Products among Biological
Process Categories

S. Cerevisiae gene products that are annotated to one or more terms in each GO aspect

Distribution of Gene Products among
Molecular Function Categories

Distribution of Gene Products among Cellular
Component Categories

Genome Inventory of S. pombe

2004

2013

Genome Inventory of C. albicans

Graphical View of Protein Coding Genes of C. albicans (as of
Nov 20, 2013)

Distribution of Gene Products among Cellular
Component Categories

C. albicans gene products that are annotated to one or more terms in each GO aspect

Distribution of Gene Products among Biological
Process Categories

Distribution of Gene Products among
Molecular Function Categories

Feature type
(Total )

Saccharomyces
cerevisae

Schizosaccharomyces
pombe

6,607

5123

6,214

Chromosome length (bp)

12,157,105

12,362,167

14,324,315

Nuclear genome (bp)

12,071,326

12,342,737

14,283,895

85,779

19,430

40,420

16

3

8

Mean coding Length (bp)

1485

1426

1439

No. of Introns

272

4730

224

69.9 %

57.5 %

61.5 %

92

450

-

GC content

39 %

36 %

33.46 %

Gene density (gene per bp)

2124

2528

2342

Unique proteins

1104

681

1218

Pseudogenes

19

29

7

Centromere

16

3

8

tRNA

299

171

156

rRNA

27

47

6

snRNA

6

7

5

No. of genes

Mitochondrial genome (bp)

No. of chromosomes

Coding percentage
Non-coding RNA

Candida albicans

Table 1: Frequency and Characteristics of Short Tandem
Repeats in the Coding Sequences of Fungal Genomes

Table 2: Number, Abundance Ranking, and Proportion of Gene
Products Containing the Indicated Interpro Protein Domain yeast
species and human

Genetic and Physical maps
• The genetic map of S. cerevisiae [Mortimer et al., 1992] has been
of considerable value to yeast molecular biologists
• DNA probes from some known genes mapped to particular
chromosomes for chromosomal walking. Finally, however, physical
maps of all chromosomes have been constructed without reference
to the genetic maps.
• Beside local expansion or contraction of the genetic map, and the
fact that the overall frequency of meiotic recombination increases
with shortening chromosome size, the order of the genes
positioned on the chromosomes by genetic and physical mapping
grossly agree
• Thus, the comparison of the physical and genetic maps show that
most of the linkages have been established to give the correct
gene order but that in many cases the relative distances derived
from genetic mapping are imprecise. The obvious imprecision of
the genetic maps may be due to the fact that different yeast
strains have been used in establishing the linkages

Genetic and Physical map of yeast chromosome II

Genetic redundancy in yeast
•
•
•
•
•

•

•
•

•
•

There is a considerable degree of internal genetic redundancy in the yeast genome
It is difficult to correlate physical redundancy completely to functional redundancy because even
in yeast gene functions have been precisely defined to a limited extent
Duplicated sequences are confined to nearly the entire coding region of these genes and do not
extend into the intergenic regions
Corresponding gene products share high similarity in terms of amino acid sequence or sometimes
are even identical and, therefore, may be functionally redundant
Due to sequence differences within the promoter regions, gene expression should vary according
to the nature of the regulatory elements or other (regulatory) constraints; it may well be that one
gene copy is highly expressed while another one is lowly expressed; turning on or off expression of
a particular copy within a gene family may depend on the differentiated status of the cell (such as
mating type, sporulation, etc.)
Classical examples of redundant genes in subtelomeric regions are the yeast MEL, SUC, MGL and
MAL genes subtelomeric regions of several yeast chromosomes share highly conserved
segments, in some instances up to 30 kb, which carry duplicated genes the functions of which are
largely unknown.
Duplicated genes have also been found in clusters. E.g. in chromosome II and cluster of three
hexose transporter genes on chromosome VIII
Cluster Homology Regions (CHRs): Sequences of complete chromosomes on being compared to
each other revealed that there are large chromosome segments in which homologous genes are
arranged in the same order with the same relative transcriptional orientations on two or more
chromosomes. This is responsible for 30-40% of total redundancy
Chromosomes II and IV share the longest CHR, comprising a pair of pericentric regions of 170 and
120 kb, respectively, that share 18 pairs of homologous genes
Significance: Whatever the relative timescale and mechanisms of duplications, these events
followed by mutations affecting functional properties give a chance to result in improved
environmental fitness. On the other hand, the high gene density in yeast indicates a strong
tendency to maintain a compact genome, therefore compensatory mechanisms must exist to
remove non-functional or superfluous gene copies.

Figure: View of 53 clustered gene duplications between the 16 chromosomes of
yeast

Table: Gene duplication in S. pombe and S. cerevisiae using NCBI BlastClust

Sequence Variation among Yeast Strains
• Polymorphisms in different yeast strains is due to the following
factors:

1)

variable number of gene copies from repeated gene families

2)

individual patterns caused by the presence or absence of
particular Ty elements

3)

plasticity of the chromosome ends

4)

excisions or inversions of particular gene regions

5)

chromosome breakage has been found to occur in yeast, resulting
in karyotypes deviating from the 'normal' picture

Yeast Mitochondrial genome
•

The mitochondrial genes and their mosaic intronic structure were first identified in S. cerevisiae in
1998 . First mitochondrial gene sequenced ever was from S. cerevisiae

•





Multi-copy mitochondrial genome from S. cerevisiae is characterized by :
low gene density and high A+T content
base composition is highly heterogeneous
G+C content of the genes is approximately 30%
intergenic spacers are composed of quasi-pure A+T stretches of several hundreds of base
pairs, interrupted by more than 150 (G+C)rich clusters, ranging from 10 to 80 bp in length
(This shows why scientists have sequenced the genes and neglected the intergenic regions)

•






The genome contains the genes for
cytochrome c oxidase subunits I, II and III (cox1, cox2 and cox3)
ATP synthase subunits 6, 8 and 9 (atp6, atp8 and atp9),
apocytochrome b (cytb), a ribosomal protein (var1)
several intron-related open reading frames (ORFs)
7-8 replication origin- like (ori) elements and encodes 21S and 15S ribosomal RNAs, 24 tRNAs that
can recognize all codons, and the 9S RNA component of RNase P

•

cox1 gene and, to a lesser extent, the cytb, 21S RNA and 15S RNA genes constitute the largest
blocks of higher G+C density
atp6, atp9, cox2, cox3 and tRNA genes appear as small G+C-enriched islands in the middle of A+T
and G+C cluster-rich regions

•

Red- Exons; Grey- Introns; Yellow- rRNA; Green- tRNA; Dark blue- Ori elements

Human-Yeast connection
• By comparing the catalogue of human sequences available in the
databases with the ORFs on the completed yeast chromosomes at the
amino acid level it is estimated that:
 >30% of the yeast genes have homologues among the human genes.
 As expected, most of the genes of known function categorized in this way
represent basic functions in both organisms.
 More similarities become apparent, when ESTs are included in the
analysis.
 Most compelling protagonists among these homologues are yeast genes
that bear substantial similarity to human 'disease genes‘
 Yeast genome is 200 times smaller than the human one
 Yeast genome is only 9-10 times less complex in its capacity to code for
proteins
• Applications:
 Yeast may be a simple system to assay novel drugs or ligands in view of the
conservation of some basic mechanisms between yeast and human
cells
 This conservation that makes some yeast genes important for study of
human genetics

S. Cerevisae genes related to human disease genes

S. Cerevisae genes related to nucleotide excision repair (NER) genes

S. pombe genes related to human disease genes

S. pombe genes related to human cancer genes

Figure: Comparison of homologous genes from different species

Figure: Orthologs in different species

Figure: Comparison of proteins in S. pombe (S.p.), S. cerevisiae (S.c.) and C. elegans (C.e.)
(a) Pie chart comparing the homology of proteins of S. pombe with those of S. cerevisiae and
C. elegans; (b) Pie chart comparing the homology of proteins of S. cerevisiae with those of
S. pombe and C. elegans

S. cerevisiae had a sequence approximately 60 times
larger than any sequence previously attempted
indicating why Goffeau felt compelled to invite the
cooperation of a group of laboratories

At the time the sequencing of model organisms such
as S. cerevisiae appeared to be the logical step
towards the eventual characterization of the human
genome, a task that seemed beyond the scope of
technology due to its tremendous size of 3,000 Mb

Thank-you…
By:
Nazish Nehal,
M. Tech (Biotechnology),
University School of Biotechnology (USBT),
Guru Gobind Singh Indraprastha University,
New Delhi (INDIA)

Yeast genome project

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Yeast genome project

Ähnlich wie Yeast genome project (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Yeast genome project