Genome sequencing

Molecular Genetics
Genome
Sequencing

AAtt aa ggllaannccee
 What is a genome
 Types of genomes
 What is genomics
 How is genomics different from genetics
 Types of genomics
 Genome sequencing
 Milestones in genomic sequencing
 Technical foundations of genomics
 Steps of genome sequencing
 DNA sequencing approaches
 Hierarchical shotgun sequencing
Markers used in mapping large genomes
 Whole genome shotgun sequencing
 New technologies
Genome sequencing achievment in Bangladesh
Benefits of Genome Research

WWHHAATT IISS AA GGEENNOOMMEE??
 Genome: One complete set of genetic
information (total amount of DNA) from a haploid set of
chromosomes of a single cell in eukaryotes, in a single
chromosome in bacteria, or in the DNA or RNA of viruses.
 Basic set of chromosome in a organism.
“The whole hereditary information of an organism that is
encoded in the DNA”
•In cytogenetic genome means a single set of chromosomes.
•It is denoted by x. Genome depends on the number of ploidy of
organism.
• In Drosophila melanogaster (2n = 2x = 8); genome x = 4.
• In hexaploid Triticum aestivum (2n = 6x = 42); genome x = 7.
Continue………

The genome is found
inside every cell, and
in those that have
nucleus, the genome
is situated inside the
nucleus. Specifically,
it is all the DNA in an
organelle.
 The term genome was introduced by H. Winkler in 1920 to
denote the complete set of chromosomal and extra
chromosomal genes present in an organism, including a virus.

How How mmaannyy t tyyppeess o off g geennoommeess a arree::
1. Prokaryotic Genomes
2. Eukaryotic Genomes
• Nuclear Genomes
• Mitochondrial Genomes
• Choloroplast Genomes
If not specified, “genome” usually refers to the nuclear genome.
WWHHAATT IISS GGEENNOOMMIICCSS??
• Genomics is the study of the structure and function of
whole genomes.
• Genomics is the comprehensive study of whole sets of
genes and their interactions rather than single genes or
proteins.
• According to T.H. Roderick, genomics is the mapping and
sequencing to analyze the structure and organization of
genome.

OOrriiggiinn ooff tteerrmmiinnoollooggyy
• The term genome was used by German botanist Hans
Winker in 1920
• Collection of genes in haploid set of chromosomes
• Now it encompasses all DNA in a cell
Genomics is the sub discipline of molecular genetics
Genomics is the sub discipline of molecular genetics
devoted to the
devoted to the
 The field includes studies of intro-genomic phenomena
such as heterosis, epistasis, pleiotropy and other interactions
between loci and alleles within the genome.

 The sequence information of the genome will
show;
 The position of every gene along the chromosome,
 The regulatory regions that flank each gene, and
 The coding sequence that determines the protein
produce by each gene.
 How is Genomics different from Genetics?
Genetics as the study of inheritance and genomics as the
study of genomes.
– Genetics looks at single genes, one at a time, like a
picture or snapshot.
– Genomics looks at the big picture and examines all the
genes as an entire system.

TTyyppeess ooff GGeennoommiiccss
1. Structural: It deals with the determination of the
complete sequence of genomes and gene map.
This has progressed in steps as follows:
(i) construction of high resolution genetic and physical
maps,
(ii) sequencing of the genome, and
(iii) determination of complete set of proteins in an
organism.
2. Functional: It refers to the study of functioning of
genes and their regulation and products(metabolic
pathways), i.e., the gene expression patterns in organism.
3. Comparative: It compare genes from different genomes
to elucidate functional and evolutional relationship.

GGeennoommee SSeeqquueenncciinngg
Genome sequencing is the technique that allows
researchers to read the genetic information found in the DNA of
anything from bacteria to plants to animals. Sequencing involves
determining the order of bases, the nucleotide subunits-adenine(
A), guanine(G), cytosine(C) and thymine(T), found in
DNA.
Genome sequencing is figuring out the order of DNA nucleotides.
CChhaalllleennggeess ooff ggeennoommee sseeqquueenncciinngg
 Data produce in form of short reads, which have to be assembled correctly
in large contigs and chromosomes.
 Short reads produced have low quality bases and vector/adaptor
contaminations.
 Several genome assemblers are available but we have to check the
performance of them to search for best one.

MMiilleessttoonneess iinn GGeennoommiicc SSeeqquueenncciinngg
1977; Fred Sanger; fX 174 bacteriophage (first sequenced genome );
5,375 bp
Amino acid sequence of phage proteins
Overlapping genes only in viruses
Fig: The genetic map of phage fX174 (Overlapping reading frames)
Continue………

1995; Craig Venter & Hamilton Smith;
Haemophilus influenzae (1,830,137 bp) (1st free living).
Mycoplasma genitalium (smallest free-living, 580,000 bp; 470 genes)
1996; Saccharomyces cerevisiae; (1st eukaryote) 12,068,000 bp
1997; Escherichia coli; 4,639,221 bp; Genetically more important.
1999; Human chromosome 22; 53,000,000 bp
2000; Drosophila melanogaster; 180,000,000 bp
2001; Human; Working draft; 3,200,000,000 bp
2002; Plasmodium falciparum; 23,000,000 bp
Anopheles gambiea; 278,000,000 bp
Mus musculus; 2,500,000,000 bp
2003; Human; finished sequence, 3,200,000,000 bp
2005; Oryza sativa (first cereal grain); 489,000,000 bp
2006; Populus trichocarpa (first tree) ; 485,000,000 bp

Technical foundations of genomics
 Molecular biology: Almost all of the
underlying techniques of genomics
originated with recombinant-DNA
technology.
 DNA sequencing: In particular, almost
all DNA sequencing is still performed
using the approach pioneered by
Sanger.
 Library construction: Also essential to
high-throughput sequencing is the ability
to generate libraries of genomic clones
and then cut portions of these clones and
introduce them into other vectors.
 PCR amplification: The use of the
polymerase chain reaction (PCR) to
amplify DNA, developed in the 1980s, is
another technique at the core of
genomics approaches.
Log MW
. .
.
.
Distance
 Hybridization techniques: Finally, the use of hybridization of one nucleic
acid to another in order to detect and quantitate DNA and RNA (Southern
blotting). This method remains the basis for genomics techniques such as
microarrays.

SStteeppss ooff ggeennoommee sseeqquueenncciinngg
 Break genome into smaller fragments
 Sequence those smaller pieces
 Piece the sequences of the short fragments together
DDNNAA sseeqquueenncciinngg aapppprrooaacchheess
Two different methods used
1. Hierarchical shotgun sequencing
-Useful for sequencing genomes of higher vertebrates
that contain repetitive sequences
2. Whole genome Shotgun Sequencing
-Useful for smaller genomes

Hierarchical Hierarchical SShhoottgguunn SSeeqquueenncciinngg
• The method preferred by the Human Genome Project is
the hierarchical shotgun sequencing method.
• Also known as
– The Clone-by-Clone Strategy
– the map-based method
– map first, sequence later
– top-down sequencing
Human Genome Project adopted a map-based strategy
– Start with well-defined physical map
– Produce shortest tiling path for large-insert clones
– Assemble the sequence for each clone
– Then assemble the entire sequence, based on the physical
map

In In TThhee CClloonnee--bbyy--CClloonnee SSttrraatteeggyy
1) Markers for regions of the genomes are identified.
2) The genome is split into larger fragments (50-200kb) using restriction/cutting
enzymes that contain a known marker.
3) These fragments are cloned in bacteria (E. coli) using BACs (Bacterial
Artificial Chromosomes), where they are replicated and stored.
4) The BAC inserts are isolated and the whole genome is mapped by
finding markers regularly spaced along each chromosome to determine the
order of each cloned.
5) The fragments contained in these clones have different ends, and with
enough coverage finding a scaffold of BAC contigs. This scaffold is called
a tiling path. BAC contig that covers the entire genomic area of interest
makes up the tiling path.
6) Each BAC fragment in the Golden Path is fragmented randomly into smaller
pieces and these fragments are individually sequenced using automated
Sanger sequencing and sequenced on both strands.
7) These sequences are aligned so that identical sequences are overlapping.
Assembly of the genome is done on the basis of prior knowledge of the
markers used to localize sequenced fragments to their genomic location. A
computer stitches the sequences up using the markers as a reference
guide.
Continue………

Fig: Hierarchical shotgun sequencing
In this approach, every part
of the genome is actually
sequenced roughly 4-5
times to ensure that no
part of the genome is left
out.

Each 150,000 bp fragment is inserted into a BAC (bacterial artificial
chromosome). A BAC can replicate inside a bacterial cell. A set of BACs
containing an entire human genome is called a BAC library.
The Clone-by-Clone Strategy used in
S. cerevisiae (yeast),
C. elegans (nematode),
Arabidopsis thaliana (mustard weed),
Oryza sativa,
Drosophila melanogaster and
Homo sapiens (Human), etc.

The Clone-by-Clone Strategy
The Clone-by-Clone Strategy
Markers used in mapping large genomes
Markers used in mapping large genomes
Different types of Markers are used in mapping large
genomes, Such as
A. Restriction Fragment Length Polymorphisms (RFLP)
B. Variable Number of Tandem Repeats (VNTRs)
C. Sequence Tagged Sites (STS)
D. Microsatellites, etc.

A. Restriction Fragment Length Polymorphisms (RFLP)
Polymorphism means that a genetic locus has different forms, or
alleles.
The cutting the DNA from any two individuals with a restriction
enzyme may yield fragments of different lengths, called Restriction
Fragment Length Polymorphisms (RFLP), is usually pronounced
“rifflip”.
 The pattern of RFLP generated will depend mainly on
– 1) The differentiation in DNA of selected strains (or) species
– 2) The restriction enzymes used
– 3) The DNA probe employed for southern hybridization
Steps:
a. Consider the restriction enzyme HindIII, which recognizes the sequence
AAGCTT.
b. Between two, One individual contains three sites of a chromosome, so
cutting the DNA with HindIII yields two fragments, 2 and 4 kb long.
Continue………

Figure: Detecting a RFLP
c. Another individual may lack the middle site but have the other two, so
cutting the DNA with HindIII yields one fragment 6 kb long. These
fragments are called RFLP.
Continue………

d. These restriction fragments of different lengths beteween the genotypes
can be detected on southern blots and by the use of suitable
probe. An RFLP is detected as a differential movement of a band on
the gel lanes from different species and strains. Each such bond is
regarded as single RFLP locus. So any differences among the DNA of
individuals are easy to see.
e. This RFLP is used as a marker in chromosomal mapping.
Limitations
 Requires relatively large amount of highly pure DNA
 Laborious and expensive to identify a suitable marker restriction
enzymes.
 Time consuming.
 Required expertise in auto radiography because of using radio actively
labeled probes

B. Variable Number of Tandem Repeats (VNTRs)
Due to the greater the degree of polymorphism of a RFLP, mapping
become very tedious, in this case variable number tandem repeats
(VNTRs) will be more useful.
Tandem repeats occur in DNA when a pattern of one or more nucleotides
is repeated and the repetitions are directly adjacent to each other.
An example would be:
AATTTTCCGGCCCCAAAATTCC AATTTTCCGGCCCCAAAATTCC AATTTTCCGGCCCCAAAATTCC
In which the sequence ATTCGCCAATC is repeated three times.
• A variable number tandem repeat (or VNTR) is a location in
a genome where a short nucleotide sequence is organized as a tandem
repeat.
• The repeated sequence is longer — about 10-100 base pairs long.
• The full genetic profiles of individuals reveal many differences.
• Since most human genes are the same from person to person, but
Variable Number of Tandem Repeats or VNTRs that tends to differ
among different people.
Continue………

• While the repeated sequences themselves are usually the same from
person to person, the number of times they are repeated tends to vary.
• VNTRs are highly polymorphic. These can be isolated from an
individual’s DNA and therefore relatively easy to map.
• However, VNTRs have a disadvantage as genetic markers: They tend
to bunch together at the ends of chromosomes, leaving the interiors of
the chromosomes relatively devoid of markers.

C. Sequence Tagged Sites (STS)
Another kind of genetic marker, which is very useful to genome mappers, is
the sequence-tagged site (STS).
•STSs are short sequences, about 60–1000 bp long, that can be easily
detected by PCR using specific primers.
•The sequences of small areas of this DNA may be known or unknown, so
one can design primers that will hybridize to these regions and allow PCR
to produce double stranded fragments of predictable lengths. If the proper
size appears, then the DNA has the STS of interest.
•One great advantage of STSs as a mapping tool is that no DNA must be
cloned and examined.
•Instead, the sequences of the primers used to generate an STS are
published and then anyone in the world can order those same primers and
find the same STS in an experiment that takes just a few hours.
Continue………

In this example, two PCR
primers (red) spaced 250 bp apart
have been used. Several cycles of
PCR generate many double-stranded
PCR products that are
precisely 250 bp long.
Electrophoresis of this product
allows one to measure its size
exactly and confirm that it is the
correct one.
Figure : Sequence-tagged sites

Making physical map using Sequence Tagged Sites (STS)
1. Geneticists interested in physically mapping or sequencing a given
region of a genome aim to assemble a set of clones called a contig,
which contains contiguous (actually overlapping) DNAs spanning long
distances.
2. It is essential to have vectors like BACs and YACs that hold big chunks
of DNA. Assuming we have a BAC library of the human genome, we
need some way to identify the clones that contain the region we want to
map.
3. A more reliable method is to look for STSs in the BACs. It is best to
screen the BAC library for at least two STSs, spaced hundreds of kilo-bases
apart, so BACs spanning a long distance are selected.
4. After we have found a number of positive BACs, we begin mapping by
screening them for several additional STSs, so we can line them up in
an overlapping fashion as shown in following figure. This set of
overlapping BACs is our new contig. We can now begin finer mapping,
and even sequencing, of the contig.
Continue………

Fig: Mapping with STSs.
At top left, several representative BACs are shown, with different symbols representing different STSs placed at
specific intervals. In step (a) of the mapping procedure, screen for two or more widely spaced STSs. In this case
screen for STS1 and STS4. All those BACs with either STS1 or 4 are shown at top right. The identified STSs are shown
in color. In step (b), each of these positive BACs is further screened for the presence of STS2, STS3, and STS5.The
colored symbols on the BACs at bottom right denote the STSs detected in each BAC. In step (c), align the STSs in
each BAC to form the contig. Measuring the lengths of the BACs by pulsed-field gel electrophoresis helps to pin
down the spacing between pairs of BACs.

D. Microsatellites
STSs are very useful in physical mapping or locating specific sequences in
the genome. But sometimes it is not possible to use them for genetic
mapping.
Fortunately, geneticists have discovered a class of STSs called
microsatellites.
GCTTGGTGTGATGTAGAAGGCGCCAATGCATCTCGACGTAT
GCGTATACGGGTTACCCCCTTTGCAATCAGTGCACACACAC
ACACACACACACACACACACACACACACACAGTGCCAAGCA
AAAATAACGCCAAGCAGAACGAAGACGTTCTCGAGAACACC
GCTTGGTGTGATGTAGAAGGCGCCAATGCATCTCGACGTAT
GCGTATACGGGTTACCCCCTTTGCAATCAGTGCACACACAC
ACACACACACACACACACACACACACACACAGTGCCAAGCA
AAAATAACGCCAAGCAGAACGAAGACGTTCTCGAGAACACC
 Microsatellites are similar to minisatellites in that they consist of a core
sequence repeated over and over many times in a row.
 The core sequence in typical microsatellites is smaller—usually only 2–4
bp long.
 Microsatellites are highly polymorphic; they are also widespread and
relatively uniformly distributed in the human genome.
 The number of repeats varied quite a bit from one individual to another.
 Thus, they are ideal as markers for both linkage and physical mapping.
Continue………

 In 1992, Jean Weissenbach et al produced a linkage map of the entire
human genome based on 814 microsatellites containing a C–A
dinucleotide repeat.
 The most common way to detect microsatellites is to design PCR primers
that are unique to one locus in the genome and unique on base pair on
either side of the repeated portion.
 Therefore, a single pair of PCR primers will work for every individual in the
species and produce different sized products for each of the different
length microsatellites.
 The PCR products are then separated by either gel electrophoresis. Either
way, the investigator can determine the size of the PCR product and thus
how many times the dinucleotide ("CA") was repeated for each allele.

Whole Whole ggeennoommee SShhoottgguunn SSeeqquueenncciinngg
The shotgun-sequencing strategy, first proposed by Craig Venter,
Hamilton Smith, and Leroy Hood in 1996, bypasses the mapping stage and
goes right to the sequencing stage.
This method was employed by Celera Genomics, which was a private
entity that was trying to mono-polise the human genome sequence by
patenting it, to do this they had to try and beat the publicly funded project.
Whole genome shotgun sequencing was therefore adopted by them.
1. BAC library: A BAC library is generated of random fragments of the human
genome using restriction digestion followed by cloning.
The sequencing starts with a set of BAC clones containing very large
DNA inserts, averaging about 150 kb. The insert in each BAC is sequenced
on both ends using an automated sequencer that can usually read about 500
bases at a time, so 500 bases at each end of the clone will be determined.
Assuming that 300,000 clones of human DNA are sequenced this
way, that would generate 300 million bases of sequence, or about 10% of the
total human genome. These 500-base sequences serve as an identity tag,
called a sequence-tagged connector (STC), for each BAC clone. This is the
origin of the term connector—each clone should be “connected” via its STCs
to about 30 other clones. Continue………

Steps:
1. BAC library
2. Finger printing
3. Plasmid library
4. BAC walking
5. Powerful computer
program
Fig: Whole Genome Shotgun
Sequencing Method
Continue………

2. Finger printing: This step is to fingerprint each clone by digesting it with a
restriction enzyme. This serves two important purposes. First, it tells the
insert size (the sum of the sizes of all the fragmented by the restriction
enzyme). Second, it allows one to eliminate aberrant clones whose
fragmentation patterns do not fit the consensus of the overlapping clones.
Note that this clone fingerprinting is not the same as mapping; it is just a
simple check before sequencing begins.
3. Plasmid library: A seed BAC is selected for sequencing. The seed BAC is
sub cloned into a plasmid vector by subdividing the BAC into smaller clones
only about 2 kb. A plasmid library is prepared by transforming E. coli strains
with plasmid. This whole BAC sequence allows the identification of the 30 or
so other BACs that overlap with the seed: They are the ones with STCs that
occur somewhere in the seed BAC.
4. BAC walking: Three thousand of the plasmid clones are sequenced, and
the sequences are ordered by their overlaps, producing the sequence of the
whole 150-kb BAC. Finding the BACs (about 30) with overlapping STCs, then
compare them by fingerprinting to find those with minimal overlaps, and
sequence them. This strategy, called BAC walking, would in principle allow
one laboratory to sequence the whole human genome.
Continue………

5. Powerful computer program: But we do not have that much time, so
Venter and colleagues modified the procedure by sequencing BACs at
random until they had about 35 billion bp of sequence. In principle that should
cover the human genome ten times over, giving a high degree of coverage
and accuracy. Then they fed all the sequence into a computer with a
powerful program that found areas of overlap between clones and fit their
sequences together, building the sequence of the whole genome.

Finishing
• Process of assembling raw
sequence reads into
accurate contiguous
sequence
– Required to achieve
1/10,000 accuracy
• Manual process
– Look at sequence reads at
positions where programs
can’t tell which base is the
correct one
– Fill gaps
– Ensure adequate coverage
Gap
Single
stranded
Continue………

Finishing
• To fill gaps in sequence,
design primers and
sequence from primer
• To ensure adequate
coverage, find regions
where there is not
sufficient coverage and
use specific primers for
those areas
GAP
Primer
Primer

Verification
• Region verified for the following:
– Coverage
– Sequence quality
– Contiguity
• Determine restriction-enzyme cleavage
sites
– Generate restriction map of sequenced region
– Must agree with fingerprint generated of clone
during mapping step

NNeeww tteecchhnnoollooggiieess
• A high-priority goal at the beginning of the Human
Genome Project was to develop new mapping and
sequencing technologies
• To date, no major breakthrough technology has been
developed
– Possible exception: whole-genome shotgun sequencing applied
to large genomes, Celera
AAuuttoommaatteedd sseeqquueenncceerrss
• Perhaps the most important contribution to large-scale
sequencing was the development of automated
sequencers
– Most use Sanger sequencing method
– Fluorescently labeled reaction products
– Capillary electrophoresis for separation

Automated sequencers: ABI
3700
96–well plate
robotic arm and syringe
96 glass capillaries
load bar
MegaBACE ABI 3700

Automatic gel reading
Computer image of
sequence read by
automated sequencer

Sequence assembly readout
Consensus building

Genome sequencing achievment in
Bangladesh
• Genome sequencing of Macrophomina phaseolina
• Genome sequencing of Jute

Genome of destructive Pathogen
Macrophomina phaseolina unraveled
by Maqsudul Alam & BJRI Associates
Genome of destructive Pathogen
Macrophomina phaseolina unraveled
by Maqsudul Alam & BJRI Associates
 Macrophomina phaseolina is a soil and seed borne fungus.
 it can infect more than 500 cultivated and wild plant species.
 It causes seedling blight, dry root rot, wilt, leaf blight, stem blight,
root and stem rot of different cultivated and wild plant species.
 The fungus can remain viable for more than 4 years in soil and
crop.
Continue………

• The Basic and Applied Research on Jute (BARJ) project team, led
by Prof Maqsudul Alam, took this unique challenge and, for the first
time in the world, decoded the genome of this most dangerous
fungus.
• They have identified the proteins and their networks that the fungus
uses to attack and kill the plant. This fundamental knowledge will help to
defend and fight against this fungus and to promote the development of
resistant varieties of jute as well as other crops.

Genome sequencing of Tossa jute Genome sequencing of Tossa jute ((CCoorrcchhoorruuss oolliittoorriiuuss))
• Jute was called the Golden Fiber of Bangladesh as
Bangladesh was the largest jute production country of the
world.
• Genome sequencing of jute has been discovered by
Bangladeshi scientists.
Continue………

• The country first time in world decoded the jute genome.
• The research team was led by Professor Maqsudul Alam from University of
Hawaii, who also successfully led the genome discovery of papaya in USA
and rubber in Malaysia.
• Also included
 a group of Bangladeshi researchers from Dhaka
University's Biochemistry and Biotechnology
departments,
 Bangladesh Jute Research Institute (BJRI)
 software firm Data Soft in collaboration with Centre
for Chemical Biology,
 University of Science, Malaysia and
 University of Hawaii have successfully decoded
the jute's genome.
This was done under the
Basic & Applied Research on Jute
Project (BARJ).

Fig: Internationally famed geneticist Maqsudul Alam
and
other scientists of jute genome project

Anticipated Anticipated BBeenneeffiittss ooff GGeennoommee RReesseeaarrcchh
Molecular Medicine
• improve diagnosis of disease
• detect genetic predispositions to disease
• create drugs based on molecular information
Microbial Genomics
• rapidly detect and treat pathogens (disease-causing microbes) in
clinical practice
• develop new energy sources (biofuels)
• monitor environments to detect pollutants
• clean up toxic waste safely and efficiently.
Risk Assessment
• evaluate the health risks faced by individuals who may be exposed to
radiation and to cancer-causing chemicals and toxins
Bio-archaeology, Anthropology, Evolution
• study evolution through mutations in lineages
• study migration of different population groups based on maternal
inheritance Continue………

• compare breakpoints in the evolution of mutations with ages of
populations and historical events.
Agriculture, Livestock Breeding, and Bio-processing
• grow disease-, insect-, and drought-resistant crops
• breed healthier, more productive, disease-resistant farm animals
• grow more nutritious produce
• develop biopesticides
• incorporate edible vaccines incorporated into food products
DNA Identification (Forensics)
• identify potential suspects whose DNA may match evidence left at
crime scenes
• identify crime victims
• establish paternity and other family relationships
• identify endangered and protected species as an aid to wildlife officials
• detect bacteria and other organisms that may pollute air, water, soil,
and food
• match organ donors with recipients in transplant programs

References
• Weaver RF 2005. Molecular Biology. McGraw-Hill
International edition, NY.
• Gardner EJ, MJ Simmons and DP Snustad 1991.
Principles of Genetics. John Wiley and Sons Inc,
NY.
• Gupta, P.K. 2007. Genetics. Rastogi Publications,
Meerut.
• Allison LA, 2007. Fundamental Molecular Biology,
Blackwell publishing, USA
• Internet

Genome sequencing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Genome sequencing

Similar to Genome sequencing (20)

Recently uploaded

Recently uploaded (20)

Genome sequencing

Editor's Notes