SlideShare ist ein Scribd-Unternehmen logo
1 von 198
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Genomic Biology and
Bioinformatics
The BioTeam
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
BioTeam™ Inc.
• Objective & vendor neutral informatics and ‘bio-IT’ consulting
• Composed of scientists who learned to bridge the gap between life
science informatics and high performance IT
• “iNquiry” bioinformatics cluster solution
• Staff
Michael Athanas Bill Van Etten
Chris Dagdigian Stan Gloss
Chris Dwan
http://bioteam.net
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Goal of this session
• Introduce major concepts in genetics, genomics, and
bioinformatics.
• Provide a minimal vocabulary to enable communication.
• Enable communication between the disciplines
Please ask questions
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Outline
• Genetics to Genomics
• Data formats & Resources
• Sequence Analysis
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Goals
• Build shared vocabulary, global view
• Introduce online and text resources
• Build interest
Not:
• Teaching molecular biology
• Teaching bioinformatics
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Motivation for this session
• Bioinformatics will be the major new application
domain for High Performance Computing (HPC)
applications over the next 50 years.
• Life Scientists will walk into the computing
center, wanting to work with you (or you will
walk into their lab…)
• No need to repeat old mistakes.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
What is Bioinformatics?
• http://bioinformatics.org/faq/#definitions
– Computational Biology
– Systems Biology
– Genetics
– Biology
– *-omics
• The application of high performance computing and data
handling techniques to life sciences research
• A major revenue stream, with lots of hype
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Genome Sizes (in base pairs)
• HIV (type 1) HIV 9,750
• Esceria Coli E. Coli 4x106
• Saccharomyces cerevisiae yeast 107
• Oryza Sativa rice 108
• Arabidopsis Thaliana “mouse-ear cress” 108
• Drosophila Melanogaster Fruit Fly 1.8x108
• Bos Taurus Cow 3x109
• Homo Sapiens Human 3x109
• Zea Mays corn 5x109
• Pinus resinosa Pine 7x1010
• Amoeba Dubia amoeba 6.7x1011
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
In context (Jan, 2004)
• Complete genomes: ~800
• 19 eukaryotic
• 16 archea
• 64 bacteria
• The rest: Viruses
• Eukaryotes with at least one sequence in GenBank:
• Between 50,000 and 100,0000
• Distinct Species
• 1.4x106 uniquely named species
• ~107 distinct species
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Genome Sizes (in base pairs)
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
What else could be
bioinformatics?
• Fold / Structure / Docking / Function predictions on proteins
and bioactive molecules
• Ontology building / literature searches / text mining /
knowledge management
• Image processing to support lab automation / data capture /
experiment steering
• Medical records integration with proteomic / transcript
studies
• Expert systems / AI / Clinical / Lab assistant
• Virtual organizations, distributed databases, ad hoc expert
conversations…
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Suffixes
• “ology”:
• Biology, Physiology, Embryology, Terminology
• Homology? Homo = same; logy = origin
• “ics”:
• Physics, Linguistics, Statistics, Bioinformatics
• “ome”:
• Proteome, Genome, Transcriptome,
• Chromosome? Chromo = color; soma = body;
• “ome-ics”:
• Proteomics, genomics
• Economics?
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Topics in Genomics
• The Central Dogma
• Levels of structure and interaction
• The Chromosome Model
• DNA Sequencing
• Genome Assembly
• Transcripts and Expression Levels
• Protein Folding
• Protein Interaction
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
What I want you to remember
• Genotype vs. Phenotype
• The Chromosome Model
• The Central Dogma
• Levels of Structure (primary -> quaternary)
• Homology is boolean
• It’s more complicated than they will admit (at first)
• http://www.bioinformatics.org
• http://www.ncbi.nih.gov
Bioinformatics is Biology
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Real Question (July 15, 2002)
“We have 10,000 BAC end reads from an organism
with massive synteny to a model organism. We
want to map markers from the model onto the
putative homologs in the BAC clones so that we can
do directed sequencing.”
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Example Question
We have 10,000 BAC end reads from an organism with massive synteny to
a model organism. We want to map markers in the model onto the
putative homologs in the BAC clones so that we can do directed
sequencing.
• What is a BAC end read? How does it differ from a BAC clone?
• What is a Homolog? Given that, what is a “putative” one?
• What is “Synteny?” Is it different from homology?
• What is a model organism?
• What are “markers?”
How can I best help this person?
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Real Question (May 30, 2002)
“Tell me all the kinases which have a valine or an
argenine within 2 angstroms of the active site.”
• What is a kinase?
• What are valine and argenine?
• What is an active site?
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Why Put The Biology First?
“Bioinformatics is full of pitfalls for those who look for
patterns or make predictions without a thorough
understanding of where biological data comes from
and what it means”
Nevin Young PhD
Professor, University of Minnesota
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
A New Way of Thinking
• "The new paradigm, now emerging, is that all genes will
be known (in the sense of being resident in databases
available electronically), and that the starting point of a
biological investigation will be theoretical.”
- Walter Gilbert, 1993
speculating on the nature of biology in the "post-genome era"
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Genetics to Genomics
• 1600’s: Europe emerges from the dark ages
• 1822 - 1884: Gregor Mendel
• 1920’s: Genetic Mapping (Morgan)
• 1952: DNA is Genetic Material (Hershey)
• 1953: DNA Helix (W & C, Franklin)
• 1966- Genetic Code (Nirenberg, Khorana)
• 1977- DNA Sequenced (Sanger)
• 1988- Human Genome Project Started
• 2001- Human Genome Draft Finished
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Selective Breeding
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Francesco Redi: 1626-1697
• Prevailing Theory “Spontaneous Generation”
– Meat makes maggots
– Straw makes mice
• Experiment:
– Meat in two jars, one open one sealed.
– Observe flies -> eggs -> maggots -> flies
– nothing happens to the closed jar meat
• Inference: Flies make flies.
• Confirmed by Pasteur in mid 1800’s
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Science Marches On!
• 1651 - William Harvey
• Theory: “Ex Ovo Omnia” From the egg, everything! (No evidence whatsoever)
• 1827 - Karl Ernst von Baer
• First mammalian egg observed under a microscope. (dog)
• 1868 - Friedrich Miescher
• DNA (“Nuclein”) first observed. (Surgical bandages from soldiers)
• 1875 - Oscar Hertwig
• Observed that fertilization in both animals and plants consists of the physical union
of the two nuclei contributed by the male and female parents. (Sea Urchin)
• 1882: Walther Flemming
• Observed chromosomes by staining cells at Meiosis (Salamander)
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Gregor Mendel (1822-1884)
• Monk, Interested in math &
gardening
• Selectively bred pea plants
– 28,000 plants over 7 years
– 7 distinct phenotypic traits.
• Published: 1866
• First Cited: 1900
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Why did Mendel succeed?
• Studied one characteristic at a time:
– Pea shape
– Internal color
– Seed-coat and flower color
– pod shape
– pod color
– flower position
– plant height
• Kept pedigrees and made several generations of crosses
• Kept track of numbers of progeny from each cross.
Mendel was really, really lucky.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Genotype vs. Phenotype
• Genotype:
– Properties (not necessarily observable) that can be passed on to
offspring
– DNA code and other genetic properties
• Phenotype:
– Observable traits of the organism
– Things we can see
Farmers have known this for a long time
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Mendelian Genetics
• Genetic “factors” (genes) determine
phenotypic traits.
• Each organism has two instances
(alleles) of each gene.
• Independent assortment: One
copy from from each parent is
(selected at random) is passed on to
each progeny.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Cell Division
• Mitosis:
• “Ordinary” cell division
• Start with 1 diploid cell
• End with 2 diploid cells
• No crossing over (or, if so, it
doesn’t matter)
• Meiosis:
• “Gametogenesis”
• Start with 1 diploid cell
• End with 4 haploid gamete
cells
• Crossing over occurs
(mechanism for independent
assortment)
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
How was Mendel Lucky?
Mendel was lucky because:
• Peas are diploid
• The traits he studied were all far apart on the chromosomes
• He didn’t use a self fertilizing (or otherwise freakish) plant
Mendel was unlucky because:
• Despite being mostly correct, his paper was rejected by his
journal of choice
• He died before anyone discovered and cited his results
• People now think that he must have cleaned his data.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
“The Chromosome Model”
With this model, we can
look at the entire range of
molecular biology, from
chromosomes to base
pairs.
This is not a mechanism
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Chromosomes
• Chromo = color
• Soma = body
• Chromosomes:
– Colored (when stained) bodies that
appear in the cell at mitosis and
meiosis
– Appear in pairs, except in gamete
cells (sperm and ova), where they
are single.
– A good candidate for the location of
genes
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Science Marches On!
• 1902: Walter Sutton
– Evidence that Mendel’s genetic factors exist on chromosomes
(grasshoppers)
Metaphase Spread Karyotype
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Number of (different)
Chromosomes
Chimpanzee 48
Cabbage 18
Camel 70
Chicken 78
Cat 34
Dog 78
Human 46
Corn 20
Alligator 32
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Chromosome Copies: “Ploidy”
“Number of copies of each chromosome”
• 2 = Diploid:
– Humans (and the majority of other eukaryotes)
• 4 = Tetraploid:
– Pine Trees
• 6 = Hexaploid:
– ??
• 8 = Octoploid:
– Starfish
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Thomas Morgan (1866-1945)
• “The Fly Room”
– Breeding experiments on Drosophila
Melanogaster (Columbia University)
• Alfred Sturtevant:
– First Chromosome Map
• Calvin Bridges:
– Chromosome theory of Heredity
• Hermann Muller:
– Mutations can be induced by X-ray
irradiation
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Why Model Organisms?
• Fruitflies:
• Only eight chromosomes.
• Reproduce very quickly, with lots of offspring.
• Tiny, so they don't take up a lot of room in the lab.
• They don't need a whole lot of food to survive.
• More Recently:
• Small genome
• Easily transformed
• Numerous mutants
• Well funded research community
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Some modern models
• Drosophila Melanogaster
• Mus Musculus
• Anopheles Gambiae
• Arabidopsis Thaliana
• Medicago Truncatula
• Oryza Sativa
• Glycene Max
• Zea Mays
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Prokaryotes vs. Eukaryotes
• Viruses: (102 genes, 104 base pairs)
• Prokaryote: (103 genes, 106 base pairs)
• No Nucleus (Mostly bacteria)
• No Introns (genes read continuously)
• One circular chromosome
• Genes clumped together in “operons”
• Much simpler genetics. Also much harder to see.
• Eukaryote: (104 genes, 109 base pairs)
• Nucleated
• Introns (Genes have untranslated “stuff” stuck in them)
• Many, linear chromosomes
• Genes spread out all over the place
• Multi-cellular and therefore more interesting.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Chromosome Mapping
y3 – 12 y2 + 2y +4 = 0
Alfred Sturtevant was an
undergraduate working in Morgan’s
lab who (the story goes) set aside his
algebra homework one night to
create the first genetic map.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Crossing Over
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Chromosome Mapping
• Linked Genes:
– Recombine less frequently than expected by Mendel’s law of
independent assortment
– Frequency of recombination  distance
– Sturtevent called the unit of distance “map units”
– Frequently referred to as “centiMorgans” after Dr. Morgan
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Crossing Over
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
A Genetic Map of Drosophila
Note that we’re
still not looking at
DNA sequences.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
DNA is the Genetic Material
1943: Oswald Avery et. al.
sacrifice mice to demonstrate
that DNA could be the material
for genes. ( to one part in
6x108)
1952: Alfred Hershey and
Martha Chase use viruses to
prove it.
“Perhaps we will be able to grind
genes in a mortar and cook
them in a beaker after all.”
-Hermann Muller
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
“At the time it was believed that DNA was a stupid
substance. A tetranucleotide which couldn’t do
anything specific.”
-Max Delbruck
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Nobel Milestones
• 1953 - 3D Structure of DNA
– Watson & Crick - model
– Wilkins & Franklin -x-ray structure
– Nobel in 1962
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
1952: Watson & Crick Structure
• Nucleotides
– ‘A’ Adenine
– ‘G’ Guanine
– ‘C’ Cytosine
– ‘T’ Thyamine
“It has not escaped our attention that
the specific pairing we have
postulated immediately suggests a
possible copying mechanism for the
genetic material.”
Watson & Crick, 1952
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Deoxyribonucleic Acid
Chromosomes are long chains
of nucleotides in complementary
strands…
...AAACTGGAGCTCACCGCGGTGGCGGC...
...GGGTCAAGATCTGTTATAACAATAAT...
Complementary single strands
have strong affinity for each
other:
G pairs with A, T pairs with C.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
“The Chromosome Model”
With this model, we can
look at the entire range of
molecular biology, from
chromosomes to base
pairs.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Nobel Milestones
• 1959 – 3D Structure of a Protein
– Perutz & Kendrew
– structure of myoglobin & hemoglobin
– Nobel in 1962
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Nobel Milestones
• 1970’s – Nucleic Acid Chemistry
– Paul Berg – recombinant DNA
– Gilbert & Sanger – sequencing
– Nobel in 1980
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Sequencing
• First DNA sequence
published by Sanger, 1955
• Generate all possible
subsequences from a fixed
5’ end (primer)
• Sort them by weight
• Read terminal nucleotide
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Sanger Sequencing
…AGTCCTG
…AGTCCT
…AGTCC
…AGTC
…AGT
…AG
…A
G A T C
•DNA of all possible lengths from a
known starting point
•Each strand ends with a radioactive
“didioxy” nucleotide which terminates
the chain
•The strands are “weighed” using gel
electrophoresis
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Modern Sequencing
• Accomplished in a single capillary tube
• Results read via a laser spectrometer
• Accurate to ~700bp
• Completely automated (~$0.04 / bp in 2003)
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Sequence Data, Errors
• Error rates for a single read = 0.002
• One error per read sequence, on average
• Types of error:
• Rare - Misreads
• Common - Deletions / double-reads
• Insertion of sequence from the vector
• Contamination with human or E. Coli DNA
• Quality tapers off at the end of a read
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Nucleotide Ambiguity Codes
A = Adenine G = Guanine T = Thymine C = Cytocine
R = A + G Y = C + T K = G + T
M = A + C S = C + G
W = A + T
V = A + C + G B = C + G + T
H = A + C + T
D = A + G + T
N = A + G + T + C
I = hypoxanthine
!(i/[GATCsn]+/)
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
“The Chromosome Model”
With this model, we can
look at the entire range of
molecular biology, from
chromosomes to base
pairs.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Restriction Enzymes
• Cut DNA at a specific subtring (different for each restriction enzyme)
…GGCTAGATTCCCTAGTTCGCTAATCGCT…
||||||||||||||||||||||||||||
…CCGATCTAAGGGATCAAGCGATTAGCGA…
Cut with “CTAGT” Restriction Enzyme
…GGCTAGATTCCCTAGA TCGCTAATCGCT…
||||||||||| ||||||||||||
…CCGATCTAAGG GATCTAGCGATTAGCGA…
Sticky Ends
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Restriction Enzymes
• “Cut” DNA only at a substring specific to the
restriction enzyme.
• Statistically, these substrings will occur several
times along the length of a chromosome:
Chromosome
Cut Sites
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Vectors
• Circular pieces of DNA with a cut site
• Used to capture pieces of DNA
Insertion site
…GGCTAGATTCCCTAGA TCGCTAATCGCT…
||||||||||| ||||||||||||
…CCGATCTAAGG GATCTAGCGATTAGCGA…
Sticky Ends
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Modern vectors
Many possible
Insertion sites
Gene coding for a brightly
colored protein so we can
visually distinguish vectors with
inserts from those without
Gene conveying
resistance to ampicillin
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Making Insert Libraries
• Separate out DNA from target organism
• Use PCR to make lots of copies of the DNA
• Cut with restriction enzymes, with vectors present in solution
• Place vectors into e. coli cells
• Spread vectorized e. coli onto agar plates
• Let grow overnight on medium with ampicillin
• Transfer only non-blue colonies into multi well plates (96 or 384).
• Sequence all the wells.
• What do you get after all this fun?
Thousands of “clone libraries” in a freezer somewhere
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Sizes of Insert Libraries
• Phage Library:
• 5 - 3,000bp
• Bacterial Artificial Chromosome (BAC):
• 80,000 - 100,000 bp
• Yeast Artificial Chromosome (YAC):
• 150,000 - 200,000 bp
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Restriction Enzymes
Restriction Fragments
• By controlling the relative amounts of DNA and
restriction enzyme, we can produce a large set of
smaller chromosome fragments
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
BAC End Sequences
Restriction Fragments
• It is “easy” to read the 700bp at each end of the
insert libraries
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
How Many Fragments?
• For a 5 letter (5-mer) restriction enzyme, odds of randomly
hitting the target sequence are approximately:
(1/4)5 = 1/1024 ≈ 10-3
• If a genome of interest is about 3x109 bp this gives us
approximately:
3x106 segments
• Using 3 or 4 unrealistic assumptions….
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Genome Sequencing: BAC Tiling
• Directed BAC Sequencing
– Read all BAC Ends & Fingerprints
– Create the minimal tiling path to cover each chromosome
– Sequence each BAC using smaller insert libraries (but the
same basic idea)
– Close Gaps (primer walking)
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Directed BAC Sequencing
Minimum Tiling Path
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Shotgun Sequencing
• Use inserts of approximately 1,000bp
• No pre-processing or ordering, use computational
techniques to assemble larger and larger fragments
• Entirely automated
• Works a lot better if someone else is doing BAC
sequencing in the public domain
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Finishing a Genome
• Sequence ought to be derived from a mixture of anonymous
individuals
• Hard to finish regions:
– Telomere
– Centromere
– Highly variable regions
• 10x coverage, 99% assembly
• Standards vary by community.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
We have a genome, now what?
• Where are the genes?
• How are genes controlled / activated?
• Can we add to / subtract from the genome?
• Why is there all that extra “junk” in there?
• What genes are common between organisms?
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Topics in Genomics
• The Central Dogma
• Levels of structure and interaction
• The Chromosome Model
• DNA Sequencing
• Genome Assembly
• Transcripts and Gene Expression
• Protein Folding
• Protein Interaction
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Central Dogma
DNA
•Four Base Pairs:
•GATC
•Double Stranded
•G->A
•T->C
•Packaged in
Chromosomes
RNA
•T->U
•Single Stranded
•Mechanism for
differential gene
expression
Amino Acid
Chains
•20 amino acids
•“Genetic Code”
translates 3 RNA to 1
amino acid
Transcription Translation
All disciplines should have the guts to admit
to having a “central dogma”
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Levels of Structure
• Primary Sequence
• Secondary Local properties
• Hydrophobic / hydrophilic regions.
• a-Helices and b-sheets
• Tertiary 3-d structure
• Quaternary Interaction
• Protein-protein interactions
• post transcriptional modification
• Enzymatic action
• $$$$
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
What is a “gene?”
• “The fundamental unit of genetic inheritance”
• “One gene, one transcript”
• One gene, one splice variant
• “One gene, one protein”
• “One gene, one heritable trait”
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Nobel Milestones
• 1960’s – Genetic Code
• Holley, Khorana and Nirenberg
• Rosetta Stone of Life
• Nobel in 1968
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
The Genetic Code
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Gamow and the Genetic Code
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Transcription & Translation
…GATC…
…CTAG…DNA
…GAUC…mRNA
Amino Acid Chain
Transcription
Translation (in one of six possible
“Reading Frames”)
…RIDVLKGEKALKASGLVP…
Protein
Folding
Anthrax Toxin Delivery Factor
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Eukaryotic genes contain Introns
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
But wait, there’s more
“Promoter”
TATA
“Start”
ATG
“Stop”
TAA
mRNA
Splicing
RNA
DNA
Introns (non
coding regions)
are removed
AAA(A100+)
Poly-A tail is attached
Open Reading Frame (ORF)
Six reading frames are possible
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Expression Level
• “What protein is being made / which gene is being
turned on when <your question here>?”
• Can approximate this with mRNA levels.
– Translation does not occur at a fixed rate
– Proteins degrade at radically different rates
– Some mRNA is never translated
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Expressed Sequence Tags
1. Select organism to study
2. Chop up organism into “libraries”
representing interesting tissues,
developmental stages, or experimental
conditions.
3. Extract and sequence as many cDNAs
as possible from each library.
4. Compare sequences to determine:
• Tissue specific gene expression
• Hypothetical functions for proteins
• Expression levels (relative concentration in
cytoplasm)
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Expressed Sequence Tags
Cell
2. Use Reverse Transcriptase
(poly-T primer) to create
cDNA
AAAAA(A100+)
1. Use Enzymes to digest DNA
& Proteins, leaving mRNA
TTTTT(T)
4. Sequence (via a complex
procedure omitted here for the
sake of brevity) the cDNA.
3. Capture the cDNA strand in
vector and incorporate into E.
Coli cells to replicate.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
EST Data
Reads of the same cDNA (product
of the same gene) produce an
assortment of sequences sharing
the Poly-A 3’, and extending a
random distance toward the 5’
end.
Issues:
• Sequence contamination with E. Coli, or vector
• Spurious groupings of cDNA from different genes containing
similar regions
• Omission of genes due to low concentration or lack of
expression (solve with additional libraries)
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
ESTs are Popular
• Human: 4x109 sequences
• Mouse: 2x109 sequences
• Medicago Truncatula: 1.6x106
• Read only the genes which are being expressed
• Get crude information about expression levels based on
frequency of a certain sequence.
• If a genome sequence is available, can locate genes on
chromosomes using similarity search
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Southern Blot
• Affix “target” single stranded sequence to a nylon membrane
• Label “probe” single stranded sequences (mRNA from cells) with a
fluorescent dye
• Wash probe over target
• Similar sequences will hybridize (stick together)
• Check for fluorescence
Target
Probe
Flourescent
Label
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Micro / Macroarrays
• Stick (hybridize) single stranded DNA
to some surface (glass slide or nylon
membrane)
• Attach fluorescent markers to the
single stranded “probe” control
sample
• Attach a different frequency of
fluorescent marker to experimentally
stressed probe sequences
• Wash probes over targets. (like will
stick to like)
• Illuminate with laser and record
differential frequency response
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Microarray Data
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Gene Chips (2003)
• 20bp sequences built using
photolithography
• Sequence must be known in
advance
• $200-$500 per “chip” from
Affymetrix (and others)
• Tools for data analysis also
available for $$
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Microarrays vs Gene Chips
• Microarrays
• Cheap to create
• No need to know sequences ahead of time (just use sample that
is already in the freezer
• Gene Chips
• Initially expensive to create
• All target sequences already known
• “The mouse chip.” “The human chip”
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Time Course Experiments
• At t=0, 5, 10, … from start of condition x
• What genes are up and down regulated
• What gene clusters seem to move together?
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Quality of Microarray data
• Spot location
• Spot size
• Differential Hybridization
• Errors in “swishing” of the probes
• In general, only differences of 1s and above are
significant.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Aspects of Protein Structure
1 XMNFSGKYQV QSQENFEPFM KAMGLPEDLI QKGKDIKGVS EIVHEGKKVK
51 LTITYGSKVI HNEFTLGEEX ELETMTGEKV KAVVKMEGDN KMVTTFKGIK
101 SVTEFNGDTI TNTMTLGDIV YKRVSKRI
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Amino Acid Codes
Alanine Ala A Arginine Arg R
Asparagine Asn N Aspartic Acid Asp D
Cysteine Cys C Glutamic Acid Glu E
Glutamine Gln Q Glyceine Gly G
Histidine His H Isoleucine Ile I
Leucine Leu L Lysine Lys K
Methionine Met M Phenylalanine Phe F
Proline Pro P Serine Ser S
Threonine Thr T Tryptophan Trp W
Tyrosine Tyr Y Valine Val V
Any Amino Acid:Z
Unknown Amino Acid: X
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
A bit more about Alanine
Molecular Structure
CH3-CH(NH2)-COOH
Molecular formula
C3H7NO2
Molecular weight:
89.09
Isoelectric point (pH):
6.00
CAS Registry Number:
56-41-7
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Protein Structure is Difficult
• There is, presently, no high throughput solution to
determining protein structure
• Crystal structure with X-Ray Crystallography
• MALDI-TOF
• Computational Techniques (not mature beyond secondary
structure)
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Dangers of Protein Structures
• If DNA sequences are cartoons…
• Protein structures are even less than that.
– Crystalline form (non biologically active)
– Low temperature
– No interactions with other molecules
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Massively parallel biology
• Sequencing:
– Large centers produce multiple megabases per day, run
24 by 7
• Expression:
– Microarrays: 100,000 “spots” in parallel.
– 1um diameter
– Read with scanning laser
– Petabytes of image data soon
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Why the Explosion?
http://www.sanger.ac.uk/Info/IT/
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
More…
• Proteomics
• Metabolomics
• Single Nucleotide Polymorphism (SNP)
• …
• Biochemical pathway analysis
• Protein - protein interaction
• …
• “Systems Biology”
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Sequence Based
Bioinformatics
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
“The Chromosome Model”
With this model, we can
look at the entire range of
molecular biology, from
chromosomes to base
pairs.
This is not a mechanism
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Levels of Structure (review)
• Primary Sequence
• Secondary Local properties
• Hydrophobic / hydrophilic regions.
• a-Helices and b-sheets
• Tertiary 3-d structure
• Quaternary Interaction
• Protein-protein interactions
• post transcriptional modification
• Enzymatic action
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Homology is evolutionary relation
• Homolog:
– Related by descent.
– This is a boolean property It is either true or false
• Can Occur Via:
– Duplication within a genome
– Separation by descent.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Other Terms
• Synteny:
– Genes share ordering between species
• Ortholog: Related by speciation
• Paralog: Related by duplication
• Wet lab: Bubbling vats of goo
• Dry lab: Whirring fans
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Comparative Genomics
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Phylogenetic Reconstruction
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Chromosome scale
rearrangements
Remarkable similarity between
mouse and human chromosomes.
But what does this picture mean?
And how would we go about
computing it?
•Traditional gene maps?
•Markers?
•Sequence similarity?
•A combination of the wet and dry
lab?
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Genetic Database Collaboration
• NCBI
– National Center for Biotechnology Information
– GenBank
– http://www.ncbi.nlm.nih.gov
• EBI
– European Bioinformatics Institute
– EMBL - European Molecular Biology Laboratory
– http://www.ebi.ac.uk
• CIB
– Center for Information Biology
– DDBJ - DNA Data Bank of Japan
– http://www.ddbj.nig.ac.jp
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
International Collaboration
NCBI CIB
EBI
Genbank DNA Databank of Japan
EMBL Nucleotide Sequence Database
Data are synchronized nightly between the three centers
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
National Center for Biotechnology Information
• The genetic sequence database of the US National Institutes of Health
• International Nucleotide Sequence Database Collaboration:
– DNA DataBank of Japan (DDBJ)
– European Molecular Biology Laboratory (EMBL)
– GenBank
• 2x1010 bases in 1.7x107 sequences
• Release every two months, daily updates
http://www.ncbi.nih.gov
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Sequence Data Sets at NCBI
• ‘NT’
• Nucleotide sequence dataset.
• Quality standards include 7x read, 1x reverse
• ‘NR’
• Non-redundant (cough cough…)
• amino acid sequence dataset
• ‘EST’
• Expressed Sequence Tag data
• Low quality, different sort of data
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Transitive Catastrophe
• Sequences of low quality are annotated by similarity
to other sequences of low quality
• This can build a corpus of erroneous data
• Which will then be used to generate statistical
models and faster algorithms
• Which will be used to mis-annotate exponentially
increasing volumes of data
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
More Sequence Data Sets
• Protein Database (PDB):
• Amino acid sequences for which a structure
has been experimentally determined
• SwissProt:
• Amino acid sequences with a high level of
annotation
• Genomes:
• All shapes and sizes
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Entrez (at NCBI)
• PubMed: The biomedical literature (PubMed)
• Nucleotide sequence database (Genbank)
• Protein sequence database
• Structure: three-dimensional macromolecular structures
• Genome: complete genome assemblies
• PopSet: population study data sets
• OMIM: Online Mendelian Inheritance in Man
• Taxonomy: organisms in GenBank
• Books: online books
• ProbeSet: gene expression and microarray datasets
• 3D Domains: domains from Entrez Structure
• UniSTS: markers and mapping data
• SNP: single nucleotide polymorphisms
• CDD: conserved domains
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Protein Structure Databases
• PDB - Protein DataBank
– Established in 1971 for protein structures
– http://www.pdb.org
– Now also includes nucleic acids, carbohydrates
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Protein Sequence Databases
• PIR - Protein Information Resource
– Protein Sequence Database (PIR-PSD)
– Established in 1984
– http://pir.georgetown.edu/
Year Amino Acid Residues Sequence Records
1984 526,466 2,676
2001 76,174,552 219,241
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Protein Sequence Databases
• SWISS-PROT
– Established in 1986
– http://www.expasy.org/sprot/
– Try to distinguish themselves by
• Annotation
• Minimal redunancy
• Integration with other databases
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
More Data Resources
• The Institute for Genome Research (TIGR)
– http://www.tigr.org
• European Molecular Biology Institutes (EMBL)
– http://www.embl.org
• European Bioinformatics Institute (EBI)
– http://www.ebi.org
• SwissProt, Trembl:
– http://www.expasy.ch
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Ensembl
• EBI’s integrative genome data toolkit.
• A web based tool in which data from various
sources are associated with chromosome maps and
locations.
• http://www.embl.org
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Distributed Annotation System (DAS)
• Client / Server system for publishing annotations to
chromosomal data.
• http://www.biodas.org
• BioMOBY: Web Services genome annotation
framework
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Protein Structures
• SCOP: “Structural Classification of Proteins”
– Superfamily
– Family
– Fold
• CASP
– Competition for protein structure prediction programs
– Results are still lacking.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Data types & Formats
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
FASTA Format
>gi|4165369|dbj|AB008577.1|AB008577 Bos taurus mRNA for MHC class I
heavy chain, partial cds, clone MP-5.10m
AGGTATTTCCACACCGCCGTGTCTCGGCCCGGCCTCCGGGAGCCCCTCTTTATC
ACGTCGGCTACGTGGACGACACGCAGTTCGTGCGGTTCGACAGCGACGCCCGGG
ATCCGAGGAAAGAACCACGGCAGCCGTGGATGGAGAAGGAGGGGCCGGAGTATT
GGGATCGCGAGACTCAAATCTCCAAGGAAAACGCACTGAAGTACCGAGAGGCCT
TAACATCCTGCGCGGCTACTACAACCAGAGCGAGGCCGGGTCTCACACCTATCA
GCGGATGTACGGCTGCGACGTGGGGCCGGACGGGCGCCTCCTCAGCGGGTTCAC
GCAGTTCGGCTACGACGGCAGAGATTACATCGCCCTGAACGAGGACCTGCGCTC
CGGACCGCGGCGGACACGGCGGCTCAGATCACCAAGCGCAAGTGGGAGGCGGCC
GGTGAGGCGGAGAGATTCAGGAACTACGTGGAGGGCCGGTGCGTGGAGTGGCTC
CGCAGATACCTG
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
FASTA Format
• Definition line:
• Required
• starts with ‘>’
• contains no line breaks
• Non-printing characters are frowned upon, but don’t break most tools.
Ctrl-A is used by some organizations to combine deflines in Unigene sets
• Data:
• Unlimited nucleotide or amino acid sequence, possibly filled with
whitespace and carriage returns.
• Capitalization does not matter (unless it does)
• FASTA files can (sometimes) be concatenated.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
GenBank Entry
LOCUS AB008577 501 bp mRNA linear MAM 22-JAN-1999
DEFINITION Bos taurus mRNA for MHC class I heavy chain, partial cds, clone
MP-5.10m.
ACCESSION AB008577
VERSION AB008577.1 GI:4165369
KEYWORDS MHC class I heavy chain.
SOURCE Bos taurus (variety:Holstein, isolate:MP-5) cultured T cells
cDNA to mRNA, clone:MP-5.10m.
ORGANISM Bos taurus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
Euteleostomi; Mammalia; Eutheria; Cetartiodactyla; Ruminantia; Pecora;
Bovoidea; Bovidae; Bovinae; Bos.
REFERENCE 1 (bases 1 to 501)
AUTHORS Urakawa,T., Kodama,M., Morita,M. and Ikeda,H.
TITLE Direct Submission
JOURNAL Submitted (02-NOV-1997) Toyohiko Urakawa, STAFF Institute, 2nd Division;
446-1 Ippaizuka, Kamiyokoba, Tsukuba, Ibaraki 305, Japan (E-
mail:urakawa@gene.staff.or.jp, Tel:+81-298-38-7757, Fax:+81-298-38-7880)
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Fun facts about GenBank
• Accession:
• Unique ID for this sequence: AB008577
• Version:
• Incremented with each update: AB008577.1
• GI:
• Old version of Accession
• Taxonomy ID:
• Link into NCBI’s Taxonomy tree
Only original authors can update data
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
GenBank Entry
FEATURES Location/Qualifiers
/organism="Bos taurus“
/variety="Holstein“
/isolate="MP-5“
/db_xref="taxon:9913“
/clone="MP-5.10m“
/cell_type="cultured T cells“
/note="BoLA class I haplotype (A8A14/A6A19);
Common E group; RT-PCR amplified clone"
CDS <1..>501
/standard_name="MHC class I related gene“
/note="particial alpha 1and 2 domains“
/codon_start=1
/product="MHC class I heavy chain“
/protein_id="BAA37151.1“
/db_xref="GI:4165370“
/translation="RYFHTAVSRPGLREPLFITVGYVDDTQFVRFDSDARDPRKEPRQ
PWMEKEGPEYWDRETQISKENALKYREALNILRGYYNQSEAGSHTYQRMYGCDVGPDG
RLLSGFTQFGYDGRDYIALNEDLRSWTAADTAAQITKRKWEAAGEAERFRNYVEGRCV
EWLRRYL“
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
GenBank Entry
BASE COUNT 105 a 148 c 173 g 75 t
ORIGIN
1 aggtatttcc acaccgccgt gtctcggccc ggcctccggg agcccctctt tatcaccgtc
61 ggctacgtgg acgacacgca gttcgtgcgg ttcgacagcg acgcccggga tccgaggaaa
121 gaaccacggc agccgtggat ggagaaggag gggccggagt attgggatcg cgagactcaa
181 atctccaagg aaaacgcact gaagtaccga gaggccttga acatcctgcg cggctactac
241 aaccagagcg aggccgggtc tcacacctat cagcggatgt acggctgcga cgtggggccg
301 gacgggcgcc tcctcagcgg gttcacgcag ttcggctacg acggcagaga ttacatcgcc
361 ctgaacgagg acctgcgctc ctggaccgcg gcggacacgg cggctcagat caccaagcgc
421 aagtgggagg cggccggtga ggcggagaga ttcaggaact acgtggaggg ccggtgcgtg
481 gagtggctcc gcagatacct g
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Ways to access data at NCBI
• http://www.ncbi.nih.gov
• Can use ENTREZ to define fairly unique sets of sequences and
download in batch
• ftp.ncbi.nih.gov:/blast/db
• Download the entire 15GB set of datasets
• http://www.bioperl.org
• Perl routines for automating small data retrieval jobs.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
NCBI Supported Formats
• ASCII GenBank Record
• FASTA
• ASN.1
• XML
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
More file formats
• Chromatogram:
• Binary output of an automated sequencer
• Phd / phred / quality file:
• ASCII file combining bases and quality values.
• ASN.1:
• Binary representation of GenBank entries
• C and C++ libraries for accessing ASN.1 are maintained by NCBI
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Sequence Handling Tasks
• Base calling
– Chromatogram -> FASTA
• Sequence Cleaning
– Search for contamination
– Vector
– host DNA
– other common sequencing artifacts.
• Contig Assembly
• Genomic Assembly
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Unigene Sets
• Contigging:
– In EST projects, cDNA reads which are believed to originate from the same
mRNA transcript are associated into contiguous segments.
– Sets of these contigged (consensus) sequences are sometimes called
“Unigene Sets.”
– Programs for doing this include:
• phrap
• TIGR Assembler
• Consed
• Arachne
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Genomic Assembly
• Genomic Assembly:
– A time and labor intensive process by which gaps in the genomic
sequence are identified, primer pairs are constructed to target those
gaps, and additional sequencing is performed.
– There is no general solution to this, nor will there be.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Microarray Analysis
• Data Management:
– GeneSpring and others: Web front end to an annotation
database for microarray informatio
• Analysis:
– Normalization
– Synthetic experiment design
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Biochemical Pathway Analysis
• Kyoto Encyclopedia of
Genes and Genomes
• http://www.genome.jp/kegg/
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Sequence Analysis
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Sequence Anaylsis Overview
• Properties of individual sequences
• Sequence alignment
• Alignment based search (BLAST)
• Multiple Sequence Alignment
• Motifs / etc.
• Statistical models / model based search
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Amino Acid Properties
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Similar Amino Acids
Tyrosine (Y) Phenylalanine (F)
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Similar Amino Acids
Aspartate (D)Glutamate (E)
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Examples from EMBOSS
• Pepstats
• Charge
• Compseq
• Pepwindow
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Comparing Sequences
>NXCI_115_B04_F 544 0 544 ABI
GTGGTAAAACTGGAGCTCACCGCGGTGGCGGCCGCTCT
ANAACTAGTGGATCCCCCGGGCTGCAGGAATTCGGCAC
GAGATTTTGACAGACATGAGCTCATATGCAGATGCTTT
GCGTGAAGTGTCTGCAGCTCGTGAAGAAGTGCCTGGCC
GACGTGGTTATCCTGGGTACATGTATACTGACTTGGCA
ACGATTTATGAACGGGCAGGACGTATTGAAGGCCGAAA
AGGCTCTATTACTCAGATTCCCATTCTGACCATGCCCA
ATGATGATATTACACACCCAATTCCAGATCTAACAGGT
TACATCACAGAAGGGCAGATATATATTGACAGGCAACT
TCATATCGACAGATATACCCACCAATCAATGTTCTTCC
ATCTCTATCACGATTGATGAAGAGTGCTATAGGGGAGG
GAATGACTCGACGGGATCATGCTGAAGTTTCAAATCAG
CTATAGCAAATTATGCAATTGGAAAGGATGTACAAGCA
ATGAAGGCTGTGGTTGGAGAGGAGGCCTTGTCATCAGA
GGATCTGCTG
>gi|4165369|dbj|AB008577.1|AB008577 Bos taurus mRNA for MHC
class I heavy chain, partial cds, clone MP-5.10m
AGGTATTTCCACACCGCCGTGTCTCGGCCCGGCCTCCGGGAG
CCCCTCTTTATCACGTCGGCTACGTGGACGACACGCAGTTCG
TGCGGTTCGACAGCGACGCCCGGGATCCGAGGAAAGAACCAC
GGCAGCCGTGGATGGAGAAGGAGGGGCCGGAGTATTGGGATC
GCGAGACTCAAATCTCCAAGGAAAACGCACTGAAGTACCGAG
AGGCCTTAACATCCTGCGCGGCTACTACAACCAGAGCGAGGC
CGGGTCTCACACCTATCAGCGGATGTACGGCTGCGACGTGGG
GCCGGACGGGCGCCTCCTCAGCGGGTTCACGCAGTTCGGCTA
CGACGGCAGAGATTACATCGCCCTGAACGAGGACCTGCGCTC
CGGACCGCGGCGGACACGGCGGCTCAGATCACCAAGCGCAAG
TGGAGGCGGCCGGTGAGGCGGAGAGATTCAGGAACTACGTGG
AGGGCCGGTGCGTGGAGTGGCTCCGCAGATACCTG
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Sequence Alignment
HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL
G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL
HBB_HUMAN GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL
HBA_HUMAN GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL
++ ++++H+ KV + +A ++ +L+ L+++H+ K
LGB2_LUPLU NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG
HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL
GS+ + G + +D L ++ H+ D+ A +AL D ++AH+
F11G11.2 GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFKAHQE
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Sequence Alignment, Fact 1
“In biomolecular sequences (DNA, RNA, or amino
acid sequences), high sequence similarity
usually implies significant functional or
structural similarity.”
Dan Gusfield
Algorithms on Strings, Trees, and Sequences. 1997. University of
Cambridge Press. p.212.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Sequence Alignment, Fact 2
“Evolutionary and functionally related molecular strings can
differ significantly throughout much of the string and yet
preserve the same three-dimensional structure(s), or the
same two dimensional substructure(s) (motifs, domains), or
the same active sites, or the same or related dispersed
residues (DNA or amino acid).”
Dan Gusfield.
Algorithms on Strings, Trees, and Sequences. 1997. University of
Cambridge Press. p.334
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Sequence Alignment
• Why do sequences appear similar?
– common ancestry
– common function
– chance
• Terms
– Identity - identical matches
– Similarity - common properties
– Homolog - common ancestor (related by descent)
• Paralog - same species, different copy / function
• Ortholog - same function, different species
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Doolittle’s Twilight Zone
• Point at which two
sequences may appear
to be related based
only on random chance
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Dottup Example
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Sequence Alignment
• Aligning two sequences:
– Insert a minimum number of gaps into one or both
sequences to maximize matches
DDLMLSPDDLAQWLTEDPGPSEAPRMSE
|||:| | |: :: ||||| |:|
DDLLL-PQDVEEFF---EGPSEALRVSG
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Sequence Alignments
• Matches may be identical
• Matches may include similar but not identical
properties
DDLMLSPDDLAQWLTEDPGPSEAPRMSE
|||:| | |: :: ||||| |:|
DDLLL-PQDVEEFF---EGPSEALRVSG
DDLMLSPDDLAQWLTEDPGPSEAPRMSE
|||:| | |: :: ||||| |:|
DDLLL-PQDVEEFF---EGPSEALRVSG
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Evolution of String Comparison
• Hamming Distance (1951): The number of locations at which the two (binary)
strings of equal length differ.
• Levenshtein Distance (1961): The number of single character insertions,
deletions, or substitutions (edits) required to transform one sequence into
another.
• “Substitution Matrices” (Dayhoff, 1978): Use of a Substitution Matrix to
encode log likelihoods of substitutions.
• “Gapped Alignments” (Many authors, 1980+): Mathematical models for
allowing gaps in alignments
• “Statistical Models” (Many authors, 1982+): No longer aligning against a
specific string, but against the compiled statistics of sets of strings.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Hamming Distance (1950’s)
Count of the differences between two sequences of identical
length
53/55 identical
ctggagctcaccgcggtggcggccgctcta
||||||||||||||||||||||||||||||
ctggagctcaccgcggtggcggccgctcta
49/55 identical
gtaaagcccaccgcggtggcggccgctcta
| ||| ||||||||||||||||||||||
ctggagctcaccgcggtggcggccgctcta
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Substitution Matrixes
• Margaret Dayhoff (1925-1983):
• “Percent Accepted Mutation” (PAM) 1973
• Substitution frequencies from “real” alignments of known
homologs, normalized to some percent mutation rate.
• 1300 sequences, 72 families, closely related within
families
• PAMij = 10(log10Rij)
• Rij = freq of (i -> j) / freq(i)
• PAM n = (PAM1)n
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
PAM 250
A R N D C Q E G H I L K M F P S T W Y V B Z X *
A 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 0 0 0 -8
R -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 -1 0 -1 -8
N 0 0 2 2 -4 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 2 1 0 -8
D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 3 3 -1 -8
C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -4 -5 -3 -8
Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 1 3 -1 -8
E 0 -1 1 3 -5 2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 3 3 -1 -8
G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 0 0 -1 -8
H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 1 2 -1 -8
I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -2 -2 -1 -8
L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -3 -3 -1 -8
K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 1 0 -1 -8
M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -2 -2 -1 -8
F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -4 -5 -2 -8
P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 -1 0 -1 -8
S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -2 -3 -1 0 0 0 -8
T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 0 -1 0 -8
W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -5 -6 -4 -8
Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 -2 -3 -4 -2 -8
V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 -2 -2 -1 -8
B 0 -1 2 3 -4 1 3 0 1 -2 -3 1 -2 -4 -1 0 0 -5 -3 -2 3 2 -1 -8
Z 0 0 1 3 -5 3 3 0 2 -2 -3 0 -2 -5 0 0 -1 -6 -4 -2 2 3 -1 -8
X 0 -1 0 -1 -3 -1 -1 -1 -1 -1 -1 -1 -1 -2 -1 0 0 -4 -2 -1 -1 -1 -1 -8
* -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 1
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
BLOcks Substitution Matrix (BLOSUM)
• Steven Henikoff, 1989
• Calculated frequency of substitutions in
conserved motifs, rather than across the
global alignments.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Scoring gapped alignments
• Fixed cost to open a gap
• Weighted (affine) cost to increase an existing gap.
• Models biological events better than a fixed cost
• To score one alignment:
– Sum substitution scores and gap costs.
• To find the best possible alignment:
– Calculate score for all possible alignments
– Pick the best one.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Global Alignments
• May miss conserved domains/motifs
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Local Alignments
• Good for finding short similar regions
(eg protein domains, motifs)
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Optimal Alignments
• Needleman & Wunsch and Smith-Waterman
• Exhaustive Search
• Alignment you get will have the best possible score
• Others may have the same score, but none better
• All pairs of sequences have an optimal alignment, whether or not they
are meaningful
• Slow
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Smith, Waterman (1981)
• Finds highest scoring region in common
• Uses a “Dynamic Programming” algorithm
• Compute time grows with the square of the length of the
sequences
• Example: Is ELVIS in the SEVENELEVEN?
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Pairwise Alignment Search
• Needleman & Wunsch (1970):
– Dynamic programming applied to global alignments
• Smith & Waterman (1981):
– Dynamic programming applied to “Local Dayhoff matrix alignments”
• Pearson et al. (1988): FASTA
• Altschul et. al. (1990): BLAST
– Heuristic approximations to Smith & Waterman allowing “reasonable” performance.
• Altschul et al. (1997): Gapped BLAST
– Further improvements to the BLAST algorithm
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Suboptimal Alignments
• Take shortcuts for sake of speed
• FASTA (Global or Local)
– Pearson and Lipman (1988)
• BLAST - Basic Local Alignment Search Tool
– Altschul, Gish, Miller, Myers and Lipman (1990)
– 10-100 times faster than regular Smith-Waterman
– Less accurate
– Today’s gold standard for searching large databases
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Why is alignment search complex?
• Perfect String Matching:
• Linear with length of strings
• … with gaps:
• Exponential (~1.5 power) with length of strings
• … seeking optimal sub-alignments:
• Exponential (~2.5 power) with length of strings
• … across an exponentially increasing set of
(potentially corrupt) data
• A whole new set of problems.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
The real problem?
• The problem is not response time on any single
step.
• The problems are
– Data management
– Throughput and updating results
– Biological Relevance
• We don’t need a faster alignment algorithm, we
need a better homology detector.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Homology Search (ideal)
• Query:
• The thing about which you want information.
• Target:
• Any data at all, preferably all of it at once
• Results:
• Continually updated as new information is published, plus
exhaustive cross references.
• Clear distinction between lab verified and automatic annotations
• “Clickable” is good.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
BLAST
• Basic Local Alignment Search Tool
• Focus on local alignments
• important similarities are often confined
to small regions within larger sequences.
• BLAST is an heuristic algorithm:
• Finds exact matches quickly (linear time)
BLAST is the single most popular homology search
program (as of 2004)
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
BLAST Search
BLAST Finds sequences that are “similar” to a query.
Sequences producing significant alignments: (bits) e-Value
gi|4165369|dbj|AB008577.1|AB008577 Bos taurus mRNA for MHC ... 993 0.0
gi|3688210|emb|AJ010861.1|BTAJ10861 Bos taurus MHC class I ... 961 0.0
gi|2864714|dbj|AB008598.1|AB008598 Bos taurus mRNA for MHC ... 882 0.0
gi|2864712|dbj|AB008597.1|AB008597 Bos taurus mRNA for MHC ... 827 0.0
gi|3688212|emb|AJ010862.1|BTAJ10862 Bos taurus MHC class I ... 803 0.0
gi|2864815|dbj|AB008649.1|AB008649 Bos taurus mRNA for MHC ... 783 0.0
…
gi|4106072|gb|AF055348.1|AF055348 Diceros bicornis minor cl... 549 e-154
…
gi|18699296|gb|AF464053.1| Sus scrofa MHC class I antigen (... 468 e-129
gi|188474|gb|M84694.1|HUMMHHLAB4 Human MHC class I HLA-B*40... 462 e-127
…
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
The BLAST Heuristic
BLAST Heuristic:
To be eligible for consideration, a sequence pair must contain an ungapped
Maximal Scoring Pair (MSP) whose score exceeds some threshold.
Two stage process:
Find HSPs (linear time)
Generate Alignments, anchored by those HSPs.
caAACTGCTGaacgttgtcgtgagttctggctgcta--
--AACTGCTGggctctc-----ccgatcggctggcaaa
This throws away the vast majority (99% in a random sample) of sequences in the
target set.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
BLAST Search Spaces
Program Query Type Database Type Number
blastp Protein Protein 1x1
blastn Nucleotide Nucleotide 1x1
blastx Nucleotide* Protein 6x1
tblastn Protein Nucleotide* 1x6
tblastx Nucleotide* Nucleotide* 6x6
*Translated all 3 reading frames on both strands
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
BLAST Scores
• “Score”
S = S(substitutions) – S(gaps)
• “Bit Score”
• Score, normalized for l and K, two parameters which should be left alone anyway,
and converted to something looking vaguely information theoretic.
Sn = [ lS - ln(K) ] / ln(2)
• “E-Value”
• “Expected number of hits of this score, in a target set of size n, with a query of
length m”
E = mn 2^Sn
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
BLAST Search
• Bits: Large scores are good
• E-value: Small scores are good
Sequences producing significant alignments: (bits) e-Value
gi|4165369|dbj|AB008577.1|AB008577 Bos taurus mRNA for MHC ... 993 0.0
gi|3688210|emb|AJ010861.1|BTAJ10861 Bos taurus MHC class I ... 961 0.0
gi|2864714|dbj|AB008598.1|AB008598 Bos taurus mRNA for MHC ... 882 0.0
gi|2864712|dbj|AB008597.1|AB008597 Bos taurus mRNA for MHC ... 827 0.0
gi|3688212|emb|AJ010862.1|BTAJ10862 Bos taurus MHC class I ... 803 0.0
gi|2864815|dbj|AB008649.1|AB008649 Bos taurus mRNA for MHC ... 783 0.0
…
gi|4106072|gb|AF055348.1|AF055348 Diceros bicornis minor cl... 549 e-154
…
gi|18699296|gb|AF464053.1| Sus scrofa MHC class I antigen (... 468 e-129
gi|188474|gb|M84694.1|HUMMHHLAB4 Human MHC class I HLA-B*40... 462 e-127
…
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
E-Value
2.71828182845904523536028747
• Unstable:
– Change every time the dataset grows.
• E-Values are not probabilities
– Yet people seem to treat them as though they are
• Rules of thumb:
– 10-30: A good, solid hit. Take it to the lab and verify it.
– 10-10: Okay. Base some further literature search on this.
– 1: Threshold of random chance
– 10: BLAST default cutoff
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
“Low Complexity Regions”
• By default, BLAST filters out
regions of “Low Complexity” and
replaces them with “XXXXX” In
the alignments.
• This may or may not be what you
want.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Potential Problems
• Round off errors
• Can fail the ‘diff’ test between 32 and 64 bit architectures
• “Silent” errors
• Check those logfiles.
• Parsing
• Please do not write another BLAST output parser.
• There are too many of them already in the world.
• Seriously.
• I’m not kidding about this one.
• Shadowing:
• Omission of interesting short hits in favor of less interesting but longer hits.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
BLAST Implementations
• NCBI BLAST
• NCBI Web Site
• NCBI command line tools
• Washington University BLAST
• (web based & command line)
• TIGR online searches
• Los Alamos National Lab
– MPI-BLAST
• TimeLogic Corporation
• “Tera-BLAST”
• Everyone else in the world…
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Is There A Parallel BLAST?
Yes.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Multiple Sequence Alignment
• Given a family of related sequences, construct an
optimal multiple sequence alignment (MSA).
• Based on that MSA, construct models which can be
used to recognize as yet unrecognized members of
the set.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Multiple Sequence Alignments
• Patterns
• Motifs
• Position Specific Scoring Matrixes
• Hidden Markov Models
• Neural Networks
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Danger Points
• No longer computing similarity to any single
observed sequence (what would they test in the
lab?)
• “Transitive Catastrophe”
• Statistical Starvation.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Beware Intellectual Inbreeding
• Using known protein families, we compute costs for amino acid
substitutions.
• Using those costs, we search for potential homologies and new (putative)
families.
• Build statistical models based on putative protein families
• Rediscover known families with statistical techniques
• Does this provide independent confirmation?
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Example: ClustalW
• Align each sequence to each other sequence
• Select a seed alignment
• Build up a multiple alignment from the pieces
• Works great for close relatives
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Conserved Patterns
• Motifs:
– Conserved substrings in multiple alignments / sets of
sequences
• Position Specific Scoring Matrixes.
– Add “at each position in an alignment” to the work of
Dayhoff.
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
What is HMMer?
• Written by Sean Eddy at Wash U
• Open Source
• 15 separate executables
• Build a statistical model of a multiple sequence alignment
• Search sequence databases with models
• Search model databases with sequences
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Search Shrimp for globin
• Build a HMM model from 50 globins
% hmmbuild globin.hmm globins50.msf
• Calibrate the model
% hmmcalibrate globin.hmm
• Search shrimp sequence database with model
% hmmsearch globin.hmm Artemia.fa
• Search model database with shrimp sequences
% hmmpfam globin.hmm Artemia.fa
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
MSF Format…
DNA_MULTIPLE_ALIGNMENT 1.0
Three anthropoidea
MSF: 50 Type: N Check: 2666 ..
Name: Homo_sapiens Len: 50 Check: 8318 Weight: 1.00
Name: Pan_paniscus Len: 50 Check: 7854 Weight: 1.00
Name: Gorilla_gorilla Len: 50 Check: 7778 Weight: 1.00
//
Homo_sapiens AGUCGAGUC...GCAGAAAC
Pan_paniscus AGUCGCGUCG..GCAGAAAC
Gorilla_gorilla AGUCGCGUCG..GCAGAUAC
Homo_sapiens GCAUGAC.GACCACAUUUU.
Pan_paniscus GCAUGACGGACCACAUCAU.
Gorilla_gorilla GCAUCACGGAC.ACAUCAUC
Homo_sapiens CCUUGCAAAG
Pan_paniscus CCUUGCAAAG
Gorilla_gorilla CCUCGCAGAG
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
hmm State Diagram
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
hmm Format
HMMER2.0 [2.2g]
NAME globins50
LENG 148
ALPH Amino
RF no
CS no
MAP yes
COM ../binaries/hmmbuild globin.hmm globins50.msf
COM ../binaries/hmmcalibrate globin.hmm
NSEQ 50
DATE Thu Jul 25 10:51:38 2002
CKSUM 9858
XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4
NULT -4 -8455
NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142
-21 -313 45 531 201 384 -1998 -644
EVD -41.853970 0.212647
HMM A C D E F G H I K L M N
P Q R S T V W Y
m->m m->i m->d i->m i->i d->m d->d b->m m->e
-661 * -1444
1 77 -228 -1302 -1020 -730 -1034 -756 578 -803 -375 82 -791 -
1461 -720 -959 364 -94 2204 -1315 -857 9
- -149 -500 233 43 -381 399 106 -626 210 -466 -720 275
394 45 96 359 117 -369 -294 -249
- -39 -5807 -6849 -894 -1115 -701 -1378 -661 *
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
What else could be
bioinformatics?
• Fold / Structure / Docking / Function predictions on proteins
and bioactive molecules
• Ontology building / literature searches / text mining /
knowledge management
• Image processing to support lab automation / data capture /
experiment steering
• Medical records integration with proteomic / transcript
studies
• Expert systems / AI / Clinical / Lab assistant
• Virtual organizations, distributed databases, ad hoc expert
conversations…
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
One would expect wet-lab scientists to have a healthy skepticism of any
results, knowing how often experiments fail, and how much bad data has
made it out into the literature, but many seem to have an almost mystical
faith in anything produced by computation.
On the other hand, computational people seem to have an almost mystical
faith in wet-lab verification---expecting experiments to be neat, quick
deterministic tests like "if" statements in code.
- Gordon D. Pusch
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
What can I do today?
• CS:
– Take biology coursework
– Accept that biology is really, really
complex and difficult.
• Bio:
– Take CS coursework
– Accept that computer engineering /
software development is tricky.
• Administrators:
– Decide to build a “spire, which will be
visible from afar”
• All:
– Attend Journal Clubs, symposia, etc.
– Get a bigger monitor
© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Thank you

Weitere ähnliche Inhalte

Was ist angesagt?

Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
biinoida
 
Database technologies in bioinformatics
Database technologies in bioinformaticsDatabase technologies in bioinformatics
Database technologies in bioinformatics
Gleb Sklyr
 

Was ist angesagt? (20)

Bioinformatic tools in Pheromone technology
Bioinformatic tools in Pheromone technologyBioinformatic tools in Pheromone technology
Bioinformatic tools in Pheromone technology
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformatics
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Bioinformatics-General_Intro
Bioinformatics-General_IntroBioinformatics-General_Intro
Bioinformatics-General_Intro
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Career oppurtunities in the field of Bioinformatics
Career oppurtunities in the field of BioinformaticsCareer oppurtunities in the field of Bioinformatics
Career oppurtunities in the field of Bioinformatics
 
BIOINFORMATICS Applications And Challenges
BIOINFORMATICS Applications And ChallengesBIOINFORMATICS Applications And Challenges
BIOINFORMATICS Applications And Challenges
 
History and scope in bioinformatics
History and scope in bioinformaticsHistory and scope in bioinformatics
History and scope in bioinformatics
 
COMPUTATIONAL BIOLOGY
COMPUTATIONAL BIOLOGYCOMPUTATIONAL BIOLOGY
COMPUTATIONAL BIOLOGY
 
Introduction to Bioinformatics.
 Introduction to Bioinformatics. Introduction to Bioinformatics.
Introduction to Bioinformatics.
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Database technologies in bioinformatics
Database technologies in bioinformaticsDatabase technologies in bioinformatics
Database technologies in bioinformatics
 
Multi-Omics Bioinformatics across Application Domains
Multi-Omics Bioinformatics across Application DomainsMulti-Omics Bioinformatics across Application Domains
Multi-Omics Bioinformatics across Application Domains
 
Careers in bioinformatics, Scope, Skills and Jobs
Careers in bioinformatics, Scope, Skills and JobsCareers in bioinformatics, Scope, Skills and Jobs
Careers in bioinformatics, Scope, Skills and Jobs
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Bioinformatics Software
Bioinformatics SoftwareBioinformatics Software
Bioinformatics Software
 

Ähnlich wie Intro bioinformatics

Genetic engineering
Genetic engineeringGenetic engineering
Genetic engineering
Arceism
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
c.titus.brown
 
Splash presentation tra slides
Splash presentation tra slidesSplash presentation tra slides
Splash presentation tra slides
Eric Holmes
 

Ähnlich wie Intro bioinformatics (20)

Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms
Lit Review Talk by Kato Mivule: A Review of Genetic AlgorithmsLit Review Talk by Kato Mivule: A Review of Genetic Algorithms
Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms
 
Computer science history.pdf
Computer science history.pdfComputer science history.pdf
Computer science history.pdf
 
Ewan Birney Biocuration 2013
Ewan Birney Biocuration 2013Ewan Birney Biocuration 2013
Ewan Birney Biocuration 2013
 
Genomics and proteomics (Bioinformatics)
Genomics and proteomics (Bioinformatics)Genomics and proteomics (Bioinformatics)
Genomics and proteomics (Bioinformatics)
 
2018 09-03-ses open-fair_practices_in_evolutionary_genomics
2018 09-03-ses open-fair_practices_in_evolutionary_genomics2018 09-03-ses open-fair_practices_in_evolutionary_genomics
2018 09-03-ses open-fair_practices_in_evolutionary_genomics
 
Ensembl Browser Workshop
Ensembl Browser WorkshopEnsembl Browser Workshop
Ensembl Browser Workshop
 
iPlant TNRS for digital collections - iDigBio Workshop
iPlant TNRS for digital collections - iDigBio WorkshopiPlant TNRS for digital collections - iDigBio Workshop
iPlant TNRS for digital collections - iDigBio Workshop
 
Genetic engineering
Genetic engineeringGenetic engineering
Genetic engineering
 
Biotechnology.pptx
Biotechnology.pptxBiotechnology.pptx
Biotechnology.pptx
 
Web Apollo: Lessons learned from community-based biocuration efforts.
Web Apollo: Lessons learned from community-based biocuration efforts.Web Apollo: Lessons learned from community-based biocuration efforts.
Web Apollo: Lessons learned from community-based biocuration efforts.
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
 
Collaboratively Creating the Knowledge Graph of Life
Collaboratively Creating the Knowledge Graph of LifeCollaboratively Creating the Knowledge Graph of Life
Collaboratively Creating the Knowledge Graph of Life
 
Biotechnology
Biotechnology Biotechnology
Biotechnology
 
Splash presentation tra slides
Splash presentation tra slidesSplash presentation tra slides
Splash presentation tra slides
 
2 chapter 5 genes and chromosome
2 chapter 5   genes and chromosome2 chapter 5   genes and chromosome
2 chapter 5 genes and chromosome
 
Bioinformatics Introduction
Bioinformatics IntroductionBioinformatics Introduction
Bioinformatics Introduction
 
Sample Prep Solutions for Microbiome Research
Sample Prep Solutions for Microbiome ResearchSample Prep Solutions for Microbiome Research
Sample Prep Solutions for Microbiome Research
 
Pruitt ppt ch07
Pruitt ppt ch07Pruitt ppt ch07
Pruitt ppt ch07
 
Lecture 1,2
Lecture 1,2Lecture 1,2
Lecture 1,2
 

Mehr von Chris Dwan

Mehr von Chris Dwan (20)

Somerville Police Staffing Final Report.pdf
Somerville Police Staffing Final Report.pdfSomerville Police Staffing Final Report.pdf
Somerville Police Staffing Final Report.pdf
 
2023 Ward 2 community meeting.pdf
2023 Ward 2 community meeting.pdf2023 Ward 2 community meeting.pdf
2023 Ward 2 community meeting.pdf
 
One Size Does Not Fit All
One Size Does Not Fit AllOne Size Does Not Fit All
One Size Does Not Fit All
 
Somerville FY23 Proposed Budget
Somerville FY23 Proposed BudgetSomerville FY23 Proposed Budget
Somerville FY23 Proposed Budget
 
Production Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on ProductionProduction Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on Production
 
#Defund thepolice
#Defund thepolice#Defund thepolice
#Defund thepolice
 
2009 cluster user training
2009 cluster user training2009 cluster user training
2009 cluster user training
 
No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciences
 
Somerville ufc memo tree hearing
Somerville ufc memo   tree hearingSomerville ufc memo   tree hearing
Somerville ufc memo tree hearing
 
2011 career-fair
2011 career-fair2011 career-fair
2011 career-fair
 
Advocacy in the Enterprise (what works, what doesn't)
Advocacy in the Enterprise (what works, what doesn't)Advocacy in the Enterprise (what works, what doesn't)
Advocacy in the Enterprise (what works, what doesn't)
 
"The Cutting Edge Can Hurt You"
"The Cutting Edge Can Hurt You""The Cutting Edge Can Hurt You"
"The Cutting Edge Can Hurt You"
 
Introduction to HPC
Introduction to HPCIntroduction to HPC
Introduction to HPC
 
Proposed tree protection ordinance
Proposed tree protection ordinanceProposed tree protection ordinance
Proposed tree protection ordinance
 
Tree Ordinance Change Matrix
Tree Ordinance Change MatrixTree Ordinance Change Matrix
Tree Ordinance Change Matrix
 
Tree protection overhaul
Tree protection overhaulTree protection overhaul
Tree protection overhaul
 
Response from newport
Response from newportResponse from newport
Response from newport
 
Sacramento underpass bid_docs
Sacramento underpass bid_docsSacramento underpass bid_docs
Sacramento underpass bid_docs
 
2019 BioIt World - Post cloud legacy edition
2019 BioIt World - Post cloud legacy edition2019 BioIt World - Post cloud legacy edition
2019 BioIt World - Post cloud legacy edition
 
Somerville tree stat 2019 02 12
Somerville tree stat 2019 02 12Somerville tree stat 2019 02 12
Somerville tree stat 2019 02 12
 

Kürzlich hochgeladen

Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
Silpa
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
Silpa
 

Kürzlich hochgeladen (20)

An introduction on sequence tagged site mapping
An introduction on sequence tagged site mappingAn introduction on sequence tagged site mapping
An introduction on sequence tagged site mapping
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdf
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptx
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
Stages in the normal growth curve
Stages in the normal growth curveStages in the normal growth curve
Stages in the normal growth curve
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 

Intro bioinformatics

  • 1. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Genomic Biology and Bioinformatics The BioTeam
  • 2. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net BioTeam™ Inc. • Objective & vendor neutral informatics and ‘bio-IT’ consulting • Composed of scientists who learned to bridge the gap between life science informatics and high performance IT • “iNquiry” bioinformatics cluster solution • Staff Michael Athanas Bill Van Etten Chris Dagdigian Stan Gloss Chris Dwan http://bioteam.net
  • 3. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Goal of this session • Introduce major concepts in genetics, genomics, and bioinformatics. • Provide a minimal vocabulary to enable communication. • Enable communication between the disciplines Please ask questions
  • 4. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Outline • Genetics to Genomics • Data formats & Resources • Sequence Analysis
  • 5. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Goals • Build shared vocabulary, global view • Introduce online and text resources • Build interest Not: • Teaching molecular biology • Teaching bioinformatics
  • 6. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Motivation for this session • Bioinformatics will be the major new application domain for High Performance Computing (HPC) applications over the next 50 years. • Life Scientists will walk into the computing center, wanting to work with you (or you will walk into their lab…) • No need to repeat old mistakes.
  • 7. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net What is Bioinformatics? • http://bioinformatics.org/faq/#definitions – Computational Biology – Systems Biology – Genetics – Biology – *-omics • The application of high performance computing and data handling techniques to life sciences research • A major revenue stream, with lots of hype
  • 8. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net
  • 9. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Genome Sizes (in base pairs) • HIV (type 1) HIV 9,750 • Esceria Coli E. Coli 4x106 • Saccharomyces cerevisiae yeast 107 • Oryza Sativa rice 108 • Arabidopsis Thaliana “mouse-ear cress” 108 • Drosophila Melanogaster Fruit Fly 1.8x108 • Bos Taurus Cow 3x109 • Homo Sapiens Human 3x109 • Zea Mays corn 5x109 • Pinus resinosa Pine 7x1010 • Amoeba Dubia amoeba 6.7x1011
  • 10. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net In context (Jan, 2004) • Complete genomes: ~800 • 19 eukaryotic • 16 archea • 64 bacteria • The rest: Viruses • Eukaryotes with at least one sequence in GenBank: • Between 50,000 and 100,0000 • Distinct Species • 1.4x106 uniquely named species • ~107 distinct species
  • 11. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Genome Sizes (in base pairs)
  • 12. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net What else could be bioinformatics? • Fold / Structure / Docking / Function predictions on proteins and bioactive molecules • Ontology building / literature searches / text mining / knowledge management • Image processing to support lab automation / data capture / experiment steering • Medical records integration with proteomic / transcript studies • Expert systems / AI / Clinical / Lab assistant • Virtual organizations, distributed databases, ad hoc expert conversations…
  • 13. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net
  • 14. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Suffixes • “ology”: • Biology, Physiology, Embryology, Terminology • Homology? Homo = same; logy = origin • “ics”: • Physics, Linguistics, Statistics, Bioinformatics • “ome”: • Proteome, Genome, Transcriptome, • Chromosome? Chromo = color; soma = body; • “ome-ics”: • Proteomics, genomics • Economics?
  • 15. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Topics in Genomics • The Central Dogma • Levels of structure and interaction • The Chromosome Model • DNA Sequencing • Genome Assembly • Transcripts and Expression Levels • Protein Folding • Protein Interaction
  • 16. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net What I want you to remember • Genotype vs. Phenotype • The Chromosome Model • The Central Dogma • Levels of Structure (primary -> quaternary) • Homology is boolean • It’s more complicated than they will admit (at first) • http://www.bioinformatics.org • http://www.ncbi.nih.gov Bioinformatics is Biology
  • 17. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Real Question (July 15, 2002) “We have 10,000 BAC end reads from an organism with massive synteny to a model organism. We want to map markers from the model onto the putative homologs in the BAC clones so that we can do directed sequencing.”
  • 18. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Example Question We have 10,000 BAC end reads from an organism with massive synteny to a model organism. We want to map markers in the model onto the putative homologs in the BAC clones so that we can do directed sequencing. • What is a BAC end read? How does it differ from a BAC clone? • What is a Homolog? Given that, what is a “putative” one? • What is “Synteny?” Is it different from homology? • What is a model organism? • What are “markers?” How can I best help this person?
  • 19. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Real Question (May 30, 2002) “Tell me all the kinases which have a valine or an argenine within 2 angstroms of the active site.” • What is a kinase? • What are valine and argenine? • What is an active site?
  • 20. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Why Put The Biology First? “Bioinformatics is full of pitfalls for those who look for patterns or make predictions without a thorough understanding of where biological data comes from and what it means” Nevin Young PhD Professor, University of Minnesota
  • 21. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net A New Way of Thinking • "The new paradigm, now emerging, is that all genes will be known (in the sense of being resident in databases available electronically), and that the starting point of a biological investigation will be theoretical.” - Walter Gilbert, 1993 speculating on the nature of biology in the "post-genome era"
  • 22. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Genetics to Genomics • 1600’s: Europe emerges from the dark ages • 1822 - 1884: Gregor Mendel • 1920’s: Genetic Mapping (Morgan) • 1952: DNA is Genetic Material (Hershey) • 1953: DNA Helix (W & C, Franklin) • 1966- Genetic Code (Nirenberg, Khorana) • 1977- DNA Sequenced (Sanger) • 1988- Human Genome Project Started • 2001- Human Genome Draft Finished
  • 23. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Selective Breeding
  • 24. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Francesco Redi: 1626-1697 • Prevailing Theory “Spontaneous Generation” – Meat makes maggots – Straw makes mice • Experiment: – Meat in two jars, one open one sealed. – Observe flies -> eggs -> maggots -> flies – nothing happens to the closed jar meat • Inference: Flies make flies. • Confirmed by Pasteur in mid 1800’s
  • 25. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Science Marches On! • 1651 - William Harvey • Theory: “Ex Ovo Omnia” From the egg, everything! (No evidence whatsoever) • 1827 - Karl Ernst von Baer • First mammalian egg observed under a microscope. (dog) • 1868 - Friedrich Miescher • DNA (“Nuclein”) first observed. (Surgical bandages from soldiers) • 1875 - Oscar Hertwig • Observed that fertilization in both animals and plants consists of the physical union of the two nuclei contributed by the male and female parents. (Sea Urchin) • 1882: Walther Flemming • Observed chromosomes by staining cells at Meiosis (Salamander)
  • 26. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Gregor Mendel (1822-1884) • Monk, Interested in math & gardening • Selectively bred pea plants – 28,000 plants over 7 years – 7 distinct phenotypic traits. • Published: 1866 • First Cited: 1900
  • 27. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Why did Mendel succeed? • Studied one characteristic at a time: – Pea shape – Internal color – Seed-coat and flower color – pod shape – pod color – flower position – plant height • Kept pedigrees and made several generations of crosses • Kept track of numbers of progeny from each cross. Mendel was really, really lucky.
  • 28. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Genotype vs. Phenotype • Genotype: – Properties (not necessarily observable) that can be passed on to offspring – DNA code and other genetic properties • Phenotype: – Observable traits of the organism – Things we can see Farmers have known this for a long time
  • 29. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Mendelian Genetics • Genetic “factors” (genes) determine phenotypic traits. • Each organism has two instances (alleles) of each gene. • Independent assortment: One copy from from each parent is (selected at random) is passed on to each progeny.
  • 30. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Cell Division • Mitosis: • “Ordinary” cell division • Start with 1 diploid cell • End with 2 diploid cells • No crossing over (or, if so, it doesn’t matter) • Meiosis: • “Gametogenesis” • Start with 1 diploid cell • End with 4 haploid gamete cells • Crossing over occurs (mechanism for independent assortment)
  • 31. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net How was Mendel Lucky? Mendel was lucky because: • Peas are diploid • The traits he studied were all far apart on the chromosomes • He didn’t use a self fertilizing (or otherwise freakish) plant Mendel was unlucky because: • Despite being mostly correct, his paper was rejected by his journal of choice • He died before anyone discovered and cited his results • People now think that he must have cleaned his data.
  • 32. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net “The Chromosome Model” With this model, we can look at the entire range of molecular biology, from chromosomes to base pairs. This is not a mechanism
  • 33. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Chromosomes • Chromo = color • Soma = body • Chromosomes: – Colored (when stained) bodies that appear in the cell at mitosis and meiosis – Appear in pairs, except in gamete cells (sperm and ova), where they are single. – A good candidate for the location of genes
  • 34. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Science Marches On! • 1902: Walter Sutton – Evidence that Mendel’s genetic factors exist on chromosomes (grasshoppers) Metaphase Spread Karyotype
  • 35. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Number of (different) Chromosomes Chimpanzee 48 Cabbage 18 Camel 70 Chicken 78 Cat 34 Dog 78 Human 46 Corn 20 Alligator 32
  • 36. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Chromosome Copies: “Ploidy” “Number of copies of each chromosome” • 2 = Diploid: – Humans (and the majority of other eukaryotes) • 4 = Tetraploid: – Pine Trees • 6 = Hexaploid: – ?? • 8 = Octoploid: – Starfish
  • 37. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Thomas Morgan (1866-1945) • “The Fly Room” – Breeding experiments on Drosophila Melanogaster (Columbia University) • Alfred Sturtevant: – First Chromosome Map • Calvin Bridges: – Chromosome theory of Heredity • Hermann Muller: – Mutations can be induced by X-ray irradiation
  • 38. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Why Model Organisms? • Fruitflies: • Only eight chromosomes. • Reproduce very quickly, with lots of offspring. • Tiny, so they don't take up a lot of room in the lab. • They don't need a whole lot of food to survive. • More Recently: • Small genome • Easily transformed • Numerous mutants • Well funded research community
  • 39. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Some modern models • Drosophila Melanogaster • Mus Musculus • Anopheles Gambiae • Arabidopsis Thaliana • Medicago Truncatula • Oryza Sativa • Glycene Max • Zea Mays
  • 40. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Prokaryotes vs. Eukaryotes • Viruses: (102 genes, 104 base pairs) • Prokaryote: (103 genes, 106 base pairs) • No Nucleus (Mostly bacteria) • No Introns (genes read continuously) • One circular chromosome • Genes clumped together in “operons” • Much simpler genetics. Also much harder to see. • Eukaryote: (104 genes, 109 base pairs) • Nucleated • Introns (Genes have untranslated “stuff” stuck in them) • Many, linear chromosomes • Genes spread out all over the place • Multi-cellular and therefore more interesting.
  • 41. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Chromosome Mapping y3 – 12 y2 + 2y +4 = 0 Alfred Sturtevant was an undergraduate working in Morgan’s lab who (the story goes) set aside his algebra homework one night to create the first genetic map.
  • 42. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Crossing Over
  • 43. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Chromosome Mapping • Linked Genes: – Recombine less frequently than expected by Mendel’s law of independent assortment – Frequency of recombination  distance – Sturtevent called the unit of distance “map units” – Frequently referred to as “centiMorgans” after Dr. Morgan
  • 44. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Crossing Over
  • 45. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net A Genetic Map of Drosophila Note that we’re still not looking at DNA sequences.
  • 46. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net DNA is the Genetic Material 1943: Oswald Avery et. al. sacrifice mice to demonstrate that DNA could be the material for genes. ( to one part in 6x108) 1952: Alfred Hershey and Martha Chase use viruses to prove it. “Perhaps we will be able to grind genes in a mortar and cook them in a beaker after all.” -Hermann Muller
  • 47. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net “At the time it was believed that DNA was a stupid substance. A tetranucleotide which couldn’t do anything specific.” -Max Delbruck
  • 48. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Nobel Milestones • 1953 - 3D Structure of DNA – Watson & Crick - model – Wilkins & Franklin -x-ray structure – Nobel in 1962
  • 49. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net 1952: Watson & Crick Structure • Nucleotides – ‘A’ Adenine – ‘G’ Guanine – ‘C’ Cytosine – ‘T’ Thyamine “It has not escaped our attention that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material.” Watson & Crick, 1952
  • 50. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Deoxyribonucleic Acid Chromosomes are long chains of nucleotides in complementary strands… ...AAACTGGAGCTCACCGCGGTGGCGGC... ...GGGTCAAGATCTGTTATAACAATAAT... Complementary single strands have strong affinity for each other: G pairs with A, T pairs with C.
  • 51. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net “The Chromosome Model” With this model, we can look at the entire range of molecular biology, from chromosomes to base pairs.
  • 52. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Nobel Milestones • 1959 – 3D Structure of a Protein – Perutz & Kendrew – structure of myoglobin & hemoglobin – Nobel in 1962
  • 53. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Nobel Milestones • 1970’s – Nucleic Acid Chemistry – Paul Berg – recombinant DNA – Gilbert & Sanger – sequencing – Nobel in 1980
  • 54. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sequencing • First DNA sequence published by Sanger, 1955 • Generate all possible subsequences from a fixed 5’ end (primer) • Sort them by weight • Read terminal nucleotide
  • 55. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sanger Sequencing …AGTCCTG …AGTCCT …AGTCC …AGTC …AGT …AG …A G A T C •DNA of all possible lengths from a known starting point •Each strand ends with a radioactive “didioxy” nucleotide which terminates the chain •The strands are “weighed” using gel electrophoresis
  • 56. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Modern Sequencing • Accomplished in a single capillary tube • Results read via a laser spectrometer • Accurate to ~700bp • Completely automated (~$0.04 / bp in 2003)
  • 57. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sequence Data, Errors • Error rates for a single read = 0.002 • One error per read sequence, on average • Types of error: • Rare - Misreads • Common - Deletions / double-reads • Insertion of sequence from the vector • Contamination with human or E. Coli DNA • Quality tapers off at the end of a read
  • 58. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Nucleotide Ambiguity Codes A = Adenine G = Guanine T = Thymine C = Cytocine R = A + G Y = C + T K = G + T M = A + C S = C + G W = A + T V = A + C + G B = C + G + T H = A + C + T D = A + G + T N = A + G + T + C I = hypoxanthine !(i/[GATCsn]+/)
  • 59. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net “The Chromosome Model” With this model, we can look at the entire range of molecular biology, from chromosomes to base pairs.
  • 60. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Restriction Enzymes • Cut DNA at a specific subtring (different for each restriction enzyme) …GGCTAGATTCCCTAGTTCGCTAATCGCT… |||||||||||||||||||||||||||| …CCGATCTAAGGGATCAAGCGATTAGCGA… Cut with “CTAGT” Restriction Enzyme …GGCTAGATTCCCTAGA TCGCTAATCGCT… ||||||||||| |||||||||||| …CCGATCTAAGG GATCTAGCGATTAGCGA… Sticky Ends
  • 61. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Restriction Enzymes • “Cut” DNA only at a substring specific to the restriction enzyme. • Statistically, these substrings will occur several times along the length of a chromosome: Chromosome Cut Sites
  • 62. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Vectors • Circular pieces of DNA with a cut site • Used to capture pieces of DNA Insertion site …GGCTAGATTCCCTAGA TCGCTAATCGCT… ||||||||||| |||||||||||| …CCGATCTAAGG GATCTAGCGATTAGCGA… Sticky Ends
  • 63. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Modern vectors Many possible Insertion sites Gene coding for a brightly colored protein so we can visually distinguish vectors with inserts from those without Gene conveying resistance to ampicillin
  • 64. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Making Insert Libraries • Separate out DNA from target organism • Use PCR to make lots of copies of the DNA • Cut with restriction enzymes, with vectors present in solution • Place vectors into e. coli cells • Spread vectorized e. coli onto agar plates • Let grow overnight on medium with ampicillin • Transfer only non-blue colonies into multi well plates (96 or 384). • Sequence all the wells. • What do you get after all this fun? Thousands of “clone libraries” in a freezer somewhere
  • 65. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sizes of Insert Libraries • Phage Library: • 5 - 3,000bp • Bacterial Artificial Chromosome (BAC): • 80,000 - 100,000 bp • Yeast Artificial Chromosome (YAC): • 150,000 - 200,000 bp
  • 66. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Restriction Enzymes Restriction Fragments • By controlling the relative amounts of DNA and restriction enzyme, we can produce a large set of smaller chromosome fragments
  • 67. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net BAC End Sequences Restriction Fragments • It is “easy” to read the 700bp at each end of the insert libraries
  • 68. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net How Many Fragments? • For a 5 letter (5-mer) restriction enzyme, odds of randomly hitting the target sequence are approximately: (1/4)5 = 1/1024 ≈ 10-3 • If a genome of interest is about 3x109 bp this gives us approximately: 3x106 segments • Using 3 or 4 unrealistic assumptions….
  • 69. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Genome Sequencing: BAC Tiling • Directed BAC Sequencing – Read all BAC Ends & Fingerprints – Create the minimal tiling path to cover each chromosome – Sequence each BAC using smaller insert libraries (but the same basic idea) – Close Gaps (primer walking)
  • 70. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Directed BAC Sequencing Minimum Tiling Path
  • 71. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Shotgun Sequencing • Use inserts of approximately 1,000bp • No pre-processing or ordering, use computational techniques to assemble larger and larger fragments • Entirely automated • Works a lot better if someone else is doing BAC sequencing in the public domain
  • 72. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Finishing a Genome • Sequence ought to be derived from a mixture of anonymous individuals • Hard to finish regions: – Telomere – Centromere – Highly variable regions • 10x coverage, 99% assembly • Standards vary by community.
  • 73. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net We have a genome, now what? • Where are the genes? • How are genes controlled / activated? • Can we add to / subtract from the genome? • Why is there all that extra “junk” in there? • What genes are common between organisms?
  • 74. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Topics in Genomics • The Central Dogma • Levels of structure and interaction • The Chromosome Model • DNA Sequencing • Genome Assembly • Transcripts and Gene Expression • Protein Folding • Protein Interaction
  • 75. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Central Dogma DNA •Four Base Pairs: •GATC •Double Stranded •G->A •T->C •Packaged in Chromosomes RNA •T->U •Single Stranded •Mechanism for differential gene expression Amino Acid Chains •20 amino acids •“Genetic Code” translates 3 RNA to 1 amino acid Transcription Translation All disciplines should have the guts to admit to having a “central dogma”
  • 76. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Levels of Structure • Primary Sequence • Secondary Local properties • Hydrophobic / hydrophilic regions. • a-Helices and b-sheets • Tertiary 3-d structure • Quaternary Interaction • Protein-protein interactions • post transcriptional modification • Enzymatic action • $$$$
  • 77. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net What is a “gene?” • “The fundamental unit of genetic inheritance” • “One gene, one transcript” • One gene, one splice variant • “One gene, one protein” • “One gene, one heritable trait”
  • 78. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Nobel Milestones • 1960’s – Genetic Code • Holley, Khorana and Nirenberg • Rosetta Stone of Life • Nobel in 1968
  • 79. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net The Genetic Code
  • 80. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Gamow and the Genetic Code
  • 81. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Transcription & Translation …GATC… …CTAG…DNA …GAUC…mRNA Amino Acid Chain Transcription Translation (in one of six possible “Reading Frames”) …RIDVLKGEKALKASGLVP… Protein Folding Anthrax Toxin Delivery Factor
  • 82. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Eukaryotic genes contain Introns
  • 83. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net But wait, there’s more “Promoter” TATA “Start” ATG “Stop” TAA mRNA Splicing RNA DNA Introns (non coding regions) are removed AAA(A100+) Poly-A tail is attached Open Reading Frame (ORF) Six reading frames are possible
  • 84. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Expression Level • “What protein is being made / which gene is being turned on when <your question here>?” • Can approximate this with mRNA levels. – Translation does not occur at a fixed rate – Proteins degrade at radically different rates – Some mRNA is never translated
  • 85. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Expressed Sequence Tags 1. Select organism to study 2. Chop up organism into “libraries” representing interesting tissues, developmental stages, or experimental conditions. 3. Extract and sequence as many cDNAs as possible from each library. 4. Compare sequences to determine: • Tissue specific gene expression • Hypothetical functions for proteins • Expression levels (relative concentration in cytoplasm)
  • 86. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Expressed Sequence Tags Cell 2. Use Reverse Transcriptase (poly-T primer) to create cDNA AAAAA(A100+) 1. Use Enzymes to digest DNA & Proteins, leaving mRNA TTTTT(T) 4. Sequence (via a complex procedure omitted here for the sake of brevity) the cDNA. 3. Capture the cDNA strand in vector and incorporate into E. Coli cells to replicate.
  • 87. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net EST Data Reads of the same cDNA (product of the same gene) produce an assortment of sequences sharing the Poly-A 3’, and extending a random distance toward the 5’ end. Issues: • Sequence contamination with E. Coli, or vector • Spurious groupings of cDNA from different genes containing similar regions • Omission of genes due to low concentration or lack of expression (solve with additional libraries)
  • 88. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net ESTs are Popular • Human: 4x109 sequences • Mouse: 2x109 sequences • Medicago Truncatula: 1.6x106 • Read only the genes which are being expressed • Get crude information about expression levels based on frequency of a certain sequence. • If a genome sequence is available, can locate genes on chromosomes using similarity search
  • 89. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Southern Blot • Affix “target” single stranded sequence to a nylon membrane • Label “probe” single stranded sequences (mRNA from cells) with a fluorescent dye • Wash probe over target • Similar sequences will hybridize (stick together) • Check for fluorescence Target Probe Flourescent Label
  • 90. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Micro / Macroarrays • Stick (hybridize) single stranded DNA to some surface (glass slide or nylon membrane) • Attach fluorescent markers to the single stranded “probe” control sample • Attach a different frequency of fluorescent marker to experimentally stressed probe sequences • Wash probes over targets. (like will stick to like) • Illuminate with laser and record differential frequency response
  • 91. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Microarray Data
  • 92. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net
  • 93. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Gene Chips (2003) • 20bp sequences built using photolithography • Sequence must be known in advance • $200-$500 per “chip” from Affymetrix (and others) • Tools for data analysis also available for $$
  • 94. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Microarrays vs Gene Chips • Microarrays • Cheap to create • No need to know sequences ahead of time (just use sample that is already in the freezer • Gene Chips • Initially expensive to create • All target sequences already known • “The mouse chip.” “The human chip”
  • 95. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Time Course Experiments • At t=0, 5, 10, … from start of condition x • What genes are up and down regulated • What gene clusters seem to move together?
  • 96. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Quality of Microarray data • Spot location • Spot size • Differential Hybridization • Errors in “swishing” of the probes • In general, only differences of 1s and above are significant.
  • 97. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Aspects of Protein Structure 1 XMNFSGKYQV QSQENFEPFM KAMGLPEDLI QKGKDIKGVS EIVHEGKKVK 51 LTITYGSKVI HNEFTLGEEX ELETMTGEKV KAVVKMEGDN KMVTTFKGIK 101 SVTEFNGDTI TNTMTLGDIV YKRVSKRI
  • 98. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Amino Acid Codes Alanine Ala A Arginine Arg R Asparagine Asn N Aspartic Acid Asp D Cysteine Cys C Glutamic Acid Glu E Glutamine Gln Q Glyceine Gly G Histidine His H Isoleucine Ile I Leucine Leu L Lysine Lys K Methionine Met M Phenylalanine Phe F Proline Pro P Serine Ser S Threonine Thr T Tryptophan Trp W Tyrosine Tyr Y Valine Val V Any Amino Acid:Z Unknown Amino Acid: X
  • 99. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net A bit more about Alanine Molecular Structure CH3-CH(NH2)-COOH Molecular formula C3H7NO2 Molecular weight: 89.09 Isoelectric point (pH): 6.00 CAS Registry Number: 56-41-7
  • 100. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Protein Structure is Difficult • There is, presently, no high throughput solution to determining protein structure • Crystal structure with X-Ray Crystallography • MALDI-TOF • Computational Techniques (not mature beyond secondary structure)
  • 101. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Dangers of Protein Structures • If DNA sequences are cartoons… • Protein structures are even less than that. – Crystalline form (non biologically active) – Low temperature – No interactions with other molecules
  • 102. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Massively parallel biology • Sequencing: – Large centers produce multiple megabases per day, run 24 by 7 • Expression: – Microarrays: 100,000 “spots” in parallel. – 1um diameter – Read with scanning laser – Petabytes of image data soon
  • 103. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net
  • 104. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Why the Explosion? http://www.sanger.ac.uk/Info/IT/
  • 105. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net More… • Proteomics • Metabolomics • Single Nucleotide Polymorphism (SNP) • … • Biochemical pathway analysis • Protein - protein interaction • … • “Systems Biology”
  • 106. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sequence Based Bioinformatics
  • 107. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net “The Chromosome Model” With this model, we can look at the entire range of molecular biology, from chromosomes to base pairs. This is not a mechanism
  • 108. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Levels of Structure (review) • Primary Sequence • Secondary Local properties • Hydrophobic / hydrophilic regions. • a-Helices and b-sheets • Tertiary 3-d structure • Quaternary Interaction • Protein-protein interactions • post transcriptional modification • Enzymatic action
  • 109. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Homology is evolutionary relation • Homolog: – Related by descent. – This is a boolean property It is either true or false • Can Occur Via: – Duplication within a genome – Separation by descent.
  • 110. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Other Terms • Synteny: – Genes share ordering between species • Ortholog: Related by speciation • Paralog: Related by duplication • Wet lab: Bubbling vats of goo • Dry lab: Whirring fans
  • 111. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Comparative Genomics
  • 112. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Phylogenetic Reconstruction
  • 113. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Chromosome scale rearrangements Remarkable similarity between mouse and human chromosomes. But what does this picture mean? And how would we go about computing it? •Traditional gene maps? •Markers? •Sequence similarity? •A combination of the wet and dry lab?
  • 114. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Genetic Database Collaboration • NCBI – National Center for Biotechnology Information – GenBank – http://www.ncbi.nlm.nih.gov • EBI – European Bioinformatics Institute – EMBL - European Molecular Biology Laboratory – http://www.ebi.ac.uk • CIB – Center for Information Biology – DDBJ - DNA Data Bank of Japan – http://www.ddbj.nig.ac.jp
  • 115. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net International Collaboration NCBI CIB EBI Genbank DNA Databank of Japan EMBL Nucleotide Sequence Database Data are synchronized nightly between the three centers
  • 116. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net National Center for Biotechnology Information • The genetic sequence database of the US National Institutes of Health • International Nucleotide Sequence Database Collaboration: – DNA DataBank of Japan (DDBJ) – European Molecular Biology Laboratory (EMBL) – GenBank • 2x1010 bases in 1.7x107 sequences • Release every two months, daily updates http://www.ncbi.nih.gov
  • 117. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sequence Data Sets at NCBI • ‘NT’ • Nucleotide sequence dataset. • Quality standards include 7x read, 1x reverse • ‘NR’ • Non-redundant (cough cough…) • amino acid sequence dataset • ‘EST’ • Expressed Sequence Tag data • Low quality, different sort of data
  • 118. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Transitive Catastrophe • Sequences of low quality are annotated by similarity to other sequences of low quality • This can build a corpus of erroneous data • Which will then be used to generate statistical models and faster algorithms • Which will be used to mis-annotate exponentially increasing volumes of data
  • 119. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net More Sequence Data Sets • Protein Database (PDB): • Amino acid sequences for which a structure has been experimentally determined • SwissProt: • Amino acid sequences with a high level of annotation • Genomes: • All shapes and sizes
  • 120. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Entrez (at NCBI) • PubMed: The biomedical literature (PubMed) • Nucleotide sequence database (Genbank) • Protein sequence database • Structure: three-dimensional macromolecular structures • Genome: complete genome assemblies • PopSet: population study data sets • OMIM: Online Mendelian Inheritance in Man • Taxonomy: organisms in GenBank • Books: online books • ProbeSet: gene expression and microarray datasets • 3D Domains: domains from Entrez Structure • UniSTS: markers and mapping data • SNP: single nucleotide polymorphisms • CDD: conserved domains
  • 121. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Protein Structure Databases • PDB - Protein DataBank – Established in 1971 for protein structures – http://www.pdb.org – Now also includes nucleic acids, carbohydrates
  • 122. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Protein Sequence Databases • PIR - Protein Information Resource – Protein Sequence Database (PIR-PSD) – Established in 1984 – http://pir.georgetown.edu/ Year Amino Acid Residues Sequence Records 1984 526,466 2,676 2001 76,174,552 219,241
  • 123. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Protein Sequence Databases • SWISS-PROT – Established in 1986 – http://www.expasy.org/sprot/ – Try to distinguish themselves by • Annotation • Minimal redunancy • Integration with other databases
  • 124. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net More Data Resources • The Institute for Genome Research (TIGR) – http://www.tigr.org • European Molecular Biology Institutes (EMBL) – http://www.embl.org • European Bioinformatics Institute (EBI) – http://www.ebi.org • SwissProt, Trembl: – http://www.expasy.ch
  • 125. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Ensembl • EBI’s integrative genome data toolkit. • A web based tool in which data from various sources are associated with chromosome maps and locations. • http://www.embl.org
  • 126. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Distributed Annotation System (DAS) • Client / Server system for publishing annotations to chromosomal data. • http://www.biodas.org • BioMOBY: Web Services genome annotation framework
  • 127. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Protein Structures • SCOP: “Structural Classification of Proteins” – Superfamily – Family – Fold • CASP – Competition for protein structure prediction programs – Results are still lacking.
  • 128. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Data types & Formats
  • 129. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net FASTA Format >gi|4165369|dbj|AB008577.1|AB008577 Bos taurus mRNA for MHC class I heavy chain, partial cds, clone MP-5.10m AGGTATTTCCACACCGCCGTGTCTCGGCCCGGCCTCCGGGAGCCCCTCTTTATC ACGTCGGCTACGTGGACGACACGCAGTTCGTGCGGTTCGACAGCGACGCCCGGG ATCCGAGGAAAGAACCACGGCAGCCGTGGATGGAGAAGGAGGGGCCGGAGTATT GGGATCGCGAGACTCAAATCTCCAAGGAAAACGCACTGAAGTACCGAGAGGCCT TAACATCCTGCGCGGCTACTACAACCAGAGCGAGGCCGGGTCTCACACCTATCA GCGGATGTACGGCTGCGACGTGGGGCCGGACGGGCGCCTCCTCAGCGGGTTCAC GCAGTTCGGCTACGACGGCAGAGATTACATCGCCCTGAACGAGGACCTGCGCTC CGGACCGCGGCGGACACGGCGGCTCAGATCACCAAGCGCAAGTGGGAGGCGGCC GGTGAGGCGGAGAGATTCAGGAACTACGTGGAGGGCCGGTGCGTGGAGTGGCTC CGCAGATACCTG
  • 130. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net FASTA Format • Definition line: • Required • starts with ‘>’ • contains no line breaks • Non-printing characters are frowned upon, but don’t break most tools. Ctrl-A is used by some organizations to combine deflines in Unigene sets • Data: • Unlimited nucleotide or amino acid sequence, possibly filled with whitespace and carriage returns. • Capitalization does not matter (unless it does) • FASTA files can (sometimes) be concatenated.
  • 131. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net GenBank Entry LOCUS AB008577 501 bp mRNA linear MAM 22-JAN-1999 DEFINITION Bos taurus mRNA for MHC class I heavy chain, partial cds, clone MP-5.10m. ACCESSION AB008577 VERSION AB008577.1 GI:4165369 KEYWORDS MHC class I heavy chain. SOURCE Bos taurus (variety:Holstein, isolate:MP-5) cultured T cells cDNA to mRNA, clone:MP-5.10m. ORGANISM Bos taurus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Cetartiodactyla; Ruminantia; Pecora; Bovoidea; Bovidae; Bovinae; Bos. REFERENCE 1 (bases 1 to 501) AUTHORS Urakawa,T., Kodama,M., Morita,M. and Ikeda,H. TITLE Direct Submission JOURNAL Submitted (02-NOV-1997) Toyohiko Urakawa, STAFF Institute, 2nd Division; 446-1 Ippaizuka, Kamiyokoba, Tsukuba, Ibaraki 305, Japan (E- mail:urakawa@gene.staff.or.jp, Tel:+81-298-38-7757, Fax:+81-298-38-7880)
  • 132. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Fun facts about GenBank • Accession: • Unique ID for this sequence: AB008577 • Version: • Incremented with each update: AB008577.1 • GI: • Old version of Accession • Taxonomy ID: • Link into NCBI’s Taxonomy tree Only original authors can update data
  • 133. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net GenBank Entry FEATURES Location/Qualifiers /organism="Bos taurus“ /variety="Holstein“ /isolate="MP-5“ /db_xref="taxon:9913“ /clone="MP-5.10m“ /cell_type="cultured T cells“ /note="BoLA class I haplotype (A8A14/A6A19); Common E group; RT-PCR amplified clone" CDS <1..>501 /standard_name="MHC class I related gene“ /note="particial alpha 1and 2 domains“ /codon_start=1 /product="MHC class I heavy chain“ /protein_id="BAA37151.1“ /db_xref="GI:4165370“ /translation="RYFHTAVSRPGLREPLFITVGYVDDTQFVRFDSDARDPRKEPRQ PWMEKEGPEYWDRETQISKENALKYREALNILRGYYNQSEAGSHTYQRMYGCDVGPDG RLLSGFTQFGYDGRDYIALNEDLRSWTAADTAAQITKRKWEAAGEAERFRNYVEGRCV EWLRRYL“
  • 134. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net GenBank Entry BASE COUNT 105 a 148 c 173 g 75 t ORIGIN 1 aggtatttcc acaccgccgt gtctcggccc ggcctccggg agcccctctt tatcaccgtc 61 ggctacgtgg acgacacgca gttcgtgcgg ttcgacagcg acgcccggga tccgaggaaa 121 gaaccacggc agccgtggat ggagaaggag gggccggagt attgggatcg cgagactcaa 181 atctccaagg aaaacgcact gaagtaccga gaggccttga acatcctgcg cggctactac 241 aaccagagcg aggccgggtc tcacacctat cagcggatgt acggctgcga cgtggggccg 301 gacgggcgcc tcctcagcgg gttcacgcag ttcggctacg acggcagaga ttacatcgcc 361 ctgaacgagg acctgcgctc ctggaccgcg gcggacacgg cggctcagat caccaagcgc 421 aagtgggagg cggccggtga ggcggagaga ttcaggaact acgtggaggg ccggtgcgtg 481 gagtggctcc gcagatacct g
  • 135. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Ways to access data at NCBI • http://www.ncbi.nih.gov • Can use ENTREZ to define fairly unique sets of sequences and download in batch • ftp.ncbi.nih.gov:/blast/db • Download the entire 15GB set of datasets • http://www.bioperl.org • Perl routines for automating small data retrieval jobs.
  • 136. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net NCBI Supported Formats • ASCII GenBank Record • FASTA • ASN.1 • XML
  • 137. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net More file formats • Chromatogram: • Binary output of an automated sequencer • Phd / phred / quality file: • ASCII file combining bases and quality values. • ASN.1: • Binary representation of GenBank entries • C and C++ libraries for accessing ASN.1 are maintained by NCBI
  • 138. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sequence Handling Tasks • Base calling – Chromatogram -> FASTA • Sequence Cleaning – Search for contamination – Vector – host DNA – other common sequencing artifacts. • Contig Assembly • Genomic Assembly
  • 139. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Unigene Sets • Contigging: – In EST projects, cDNA reads which are believed to originate from the same mRNA transcript are associated into contiguous segments. – Sets of these contigged (consensus) sequences are sometimes called “Unigene Sets.” – Programs for doing this include: • phrap • TIGR Assembler • Consed • Arachne
  • 140. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Genomic Assembly • Genomic Assembly: – A time and labor intensive process by which gaps in the genomic sequence are identified, primer pairs are constructed to target those gaps, and additional sequencing is performed. – There is no general solution to this, nor will there be.
  • 141. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Microarray Analysis • Data Management: – GeneSpring and others: Web front end to an annotation database for microarray informatio • Analysis: – Normalization – Synthetic experiment design
  • 142. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Biochemical Pathway Analysis • Kyoto Encyclopedia of Genes and Genomes • http://www.genome.jp/kegg/
  • 143. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sequence Analysis
  • 144. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sequence Anaylsis Overview • Properties of individual sequences • Sequence alignment • Alignment based search (BLAST) • Multiple Sequence Alignment • Motifs / etc. • Statistical models / model based search
  • 145. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Amino Acid Properties
  • 146. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Similar Amino Acids Tyrosine (Y) Phenylalanine (F)
  • 147. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Similar Amino Acids Aspartate (D)Glutamate (E)
  • 148. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Examples from EMBOSS • Pepstats • Charge • Compseq • Pepwindow
  • 149. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Comparing Sequences >NXCI_115_B04_F 544 0 544 ABI GTGGTAAAACTGGAGCTCACCGCGGTGGCGGCCGCTCT ANAACTAGTGGATCCCCCGGGCTGCAGGAATTCGGCAC GAGATTTTGACAGACATGAGCTCATATGCAGATGCTTT GCGTGAAGTGTCTGCAGCTCGTGAAGAAGTGCCTGGCC GACGTGGTTATCCTGGGTACATGTATACTGACTTGGCA ACGATTTATGAACGGGCAGGACGTATTGAAGGCCGAAA AGGCTCTATTACTCAGATTCCCATTCTGACCATGCCCA ATGATGATATTACACACCCAATTCCAGATCTAACAGGT TACATCACAGAAGGGCAGATATATATTGACAGGCAACT TCATATCGACAGATATACCCACCAATCAATGTTCTTCC ATCTCTATCACGATTGATGAAGAGTGCTATAGGGGAGG GAATGACTCGACGGGATCATGCTGAAGTTTCAAATCAG CTATAGCAAATTATGCAATTGGAAAGGATGTACAAGCA ATGAAGGCTGTGGTTGGAGAGGAGGCCTTGTCATCAGA GGATCTGCTG >gi|4165369|dbj|AB008577.1|AB008577 Bos taurus mRNA for MHC class I heavy chain, partial cds, clone MP-5.10m AGGTATTTCCACACCGCCGTGTCTCGGCCCGGCCTCCGGGAG CCCCTCTTTATCACGTCGGCTACGTGGACGACACGCAGTTCG TGCGGTTCGACAGCGACGCCCGGGATCCGAGGAAAGAACCAC GGCAGCCGTGGATGGAGAAGGAGGGGCCGGAGTATTGGGATC GCGAGACTCAAATCTCCAAGGAAAACGCACTGAAGTACCGAG AGGCCTTAACATCCTGCGCGGCTACTACAACCAGAGCGAGGC CGGGTCTCACACCTATCAGCGGATGTACGGCTGCGACGTGGG GCCGGACGGGCGCCTCCTCAGCGGGTTCACGCAGTTCGGCTA CGACGGCAGAGATTACATCGCCCTGAACGAGGACCTGCGCTC CGGACCGCGGCGGACACGGCGGCTCAGATCACCAAGCGCAAG TGGAGGCGGCCGGTGAGGCGGAGAGATTCAGGAACTACGTGG AGGGCCGGTGCGTGGAGTGGCTCCGCAGATACCTG
  • 150. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sequence Alignment HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL HBB_HUMAN GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL HBA_HUMAN GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL ++ ++++H+ KV + +A ++ +L+ L+++H+ K LGB2_LUPLU NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL GS+ + G + +D L ++ H+ D+ A +AL D ++AH+ F11G11.2 GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFKAHQE
  • 151. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sequence Alignment, Fact 1 “In biomolecular sequences (DNA, RNA, or amino acid sequences), high sequence similarity usually implies significant functional or structural similarity.” Dan Gusfield Algorithms on Strings, Trees, and Sequences. 1997. University of Cambridge Press. p.212.
  • 152. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sequence Alignment, Fact 2 “Evolutionary and functionally related molecular strings can differ significantly throughout much of the string and yet preserve the same three-dimensional structure(s), or the same two dimensional substructure(s) (motifs, domains), or the same active sites, or the same or related dispersed residues (DNA or amino acid).” Dan Gusfield. Algorithms on Strings, Trees, and Sequences. 1997. University of Cambridge Press. p.334
  • 153. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sequence Alignment • Why do sequences appear similar? – common ancestry – common function – chance • Terms – Identity - identical matches – Similarity - common properties – Homolog - common ancestor (related by descent) • Paralog - same species, different copy / function • Ortholog - same function, different species
  • 154. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Doolittle’s Twilight Zone • Point at which two sequences may appear to be related based only on random chance
  • 155. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Dottup Example
  • 156. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sequence Alignment • Aligning two sequences: – Insert a minimum number of gaps into one or both sequences to maximize matches DDLMLSPDDLAQWLTEDPGPSEAPRMSE |||:| | |: :: ||||| |:| DDLLL-PQDVEEFF---EGPSEALRVSG
  • 157. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sequence Alignments • Matches may be identical • Matches may include similar but not identical properties DDLMLSPDDLAQWLTEDPGPSEAPRMSE |||:| | |: :: ||||| |:| DDLLL-PQDVEEFF---EGPSEALRVSG DDLMLSPDDLAQWLTEDPGPSEAPRMSE |||:| | |: :: ||||| |:| DDLLL-PQDVEEFF---EGPSEALRVSG
  • 158. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Evolution of String Comparison • Hamming Distance (1951): The number of locations at which the two (binary) strings of equal length differ. • Levenshtein Distance (1961): The number of single character insertions, deletions, or substitutions (edits) required to transform one sequence into another. • “Substitution Matrices” (Dayhoff, 1978): Use of a Substitution Matrix to encode log likelihoods of substitutions. • “Gapped Alignments” (Many authors, 1980+): Mathematical models for allowing gaps in alignments • “Statistical Models” (Many authors, 1982+): No longer aligning against a specific string, but against the compiled statistics of sets of strings.
  • 159. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Hamming Distance (1950’s) Count of the differences between two sequences of identical length 53/55 identical ctggagctcaccgcggtggcggccgctcta |||||||||||||||||||||||||||||| ctggagctcaccgcggtggcggccgctcta 49/55 identical gtaaagcccaccgcggtggcggccgctcta | ||| |||||||||||||||||||||| ctggagctcaccgcggtggcggccgctcta
  • 160. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Substitution Matrixes • Margaret Dayhoff (1925-1983): • “Percent Accepted Mutation” (PAM) 1973 • Substitution frequencies from “real” alignments of known homologs, normalized to some percent mutation rate. • 1300 sequences, 72 families, closely related within families • PAMij = 10(log10Rij) • Rij = freq of (i -> j) / freq(i) • PAM n = (PAM1)n
  • 161. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net PAM 250 A R N D C Q E G H I L K M F P S T W Y V B Z X * A 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 0 0 0 -8 R -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 -1 0 -1 -8 N 0 0 2 2 -4 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 2 1 0 -8 D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 3 3 -1 -8 C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -4 -5 -3 -8 Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 1 3 -1 -8 E 0 -1 1 3 -5 2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 3 3 -1 -8 G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 0 0 -1 -8 H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 1 2 -1 -8 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -2 -2 -1 -8 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -3 -3 -1 -8 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 1 0 -1 -8 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -2 -2 -1 -8 F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -4 -5 -2 -8 P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 -1 0 -1 -8 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -2 -3 -1 0 0 0 -8 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 0 -1 0 -8 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -5 -6 -4 -8 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 -2 -3 -4 -2 -8 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 -2 -2 -1 -8 B 0 -1 2 3 -4 1 3 0 1 -2 -3 1 -2 -4 -1 0 0 -5 -3 -2 3 2 -1 -8 Z 0 0 1 3 -5 3 3 0 2 -2 -3 0 -2 -5 0 0 -1 -6 -4 -2 2 3 -1 -8 X 0 -1 0 -1 -3 -1 -1 -1 -1 -1 -1 -1 -1 -2 -1 0 0 -4 -2 -1 -1 -1 -1 -8 * -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 1
  • 162. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net BLOcks Substitution Matrix (BLOSUM) • Steven Henikoff, 1989 • Calculated frequency of substitutions in conserved motifs, rather than across the global alignments.
  • 163. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Scoring gapped alignments • Fixed cost to open a gap • Weighted (affine) cost to increase an existing gap. • Models biological events better than a fixed cost • To score one alignment: – Sum substitution scores and gap costs. • To find the best possible alignment: – Calculate score for all possible alignments – Pick the best one.
  • 164. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Global Alignments • May miss conserved domains/motifs
  • 165. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Local Alignments • Good for finding short similar regions (eg protein domains, motifs)
  • 166. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Optimal Alignments • Needleman & Wunsch and Smith-Waterman • Exhaustive Search • Alignment you get will have the best possible score • Others may have the same score, but none better • All pairs of sequences have an optimal alignment, whether or not they are meaningful • Slow
  • 167. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Smith, Waterman (1981) • Finds highest scoring region in common • Uses a “Dynamic Programming” algorithm • Compute time grows with the square of the length of the sequences • Example: Is ELVIS in the SEVENELEVEN?
  • 168. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Pairwise Alignment Search • Needleman & Wunsch (1970): – Dynamic programming applied to global alignments • Smith & Waterman (1981): – Dynamic programming applied to “Local Dayhoff matrix alignments” • Pearson et al. (1988): FASTA • Altschul et. al. (1990): BLAST – Heuristic approximations to Smith & Waterman allowing “reasonable” performance. • Altschul et al. (1997): Gapped BLAST – Further improvements to the BLAST algorithm
  • 169. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Suboptimal Alignments • Take shortcuts for sake of speed • FASTA (Global or Local) – Pearson and Lipman (1988) • BLAST - Basic Local Alignment Search Tool – Altschul, Gish, Miller, Myers and Lipman (1990) – 10-100 times faster than regular Smith-Waterman – Less accurate – Today’s gold standard for searching large databases
  • 170. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Why is alignment search complex? • Perfect String Matching: • Linear with length of strings • … with gaps: • Exponential (~1.5 power) with length of strings • … seeking optimal sub-alignments: • Exponential (~2.5 power) with length of strings • … across an exponentially increasing set of (potentially corrupt) data • A whole new set of problems.
  • 171. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net The real problem? • The problem is not response time on any single step. • The problems are – Data management – Throughput and updating results – Biological Relevance • We don’t need a faster alignment algorithm, we need a better homology detector.
  • 172. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Homology Search (ideal) • Query: • The thing about which you want information. • Target: • Any data at all, preferably all of it at once • Results: • Continually updated as new information is published, plus exhaustive cross references. • Clear distinction between lab verified and automatic annotations • “Clickable” is good.
  • 173. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net BLAST • Basic Local Alignment Search Tool • Focus on local alignments • important similarities are often confined to small regions within larger sequences. • BLAST is an heuristic algorithm: • Finds exact matches quickly (linear time) BLAST is the single most popular homology search program (as of 2004)
  • 174. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net BLAST Search BLAST Finds sequences that are “similar” to a query. Sequences producing significant alignments: (bits) e-Value gi|4165369|dbj|AB008577.1|AB008577 Bos taurus mRNA for MHC ... 993 0.0 gi|3688210|emb|AJ010861.1|BTAJ10861 Bos taurus MHC class I ... 961 0.0 gi|2864714|dbj|AB008598.1|AB008598 Bos taurus mRNA for MHC ... 882 0.0 gi|2864712|dbj|AB008597.1|AB008597 Bos taurus mRNA for MHC ... 827 0.0 gi|3688212|emb|AJ010862.1|BTAJ10862 Bos taurus MHC class I ... 803 0.0 gi|2864815|dbj|AB008649.1|AB008649 Bos taurus mRNA for MHC ... 783 0.0 … gi|4106072|gb|AF055348.1|AF055348 Diceros bicornis minor cl... 549 e-154 … gi|18699296|gb|AF464053.1| Sus scrofa MHC class I antigen (... 468 e-129 gi|188474|gb|M84694.1|HUMMHHLAB4 Human MHC class I HLA-B*40... 462 e-127 …
  • 175. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net The BLAST Heuristic BLAST Heuristic: To be eligible for consideration, a sequence pair must contain an ungapped Maximal Scoring Pair (MSP) whose score exceeds some threshold. Two stage process: Find HSPs (linear time) Generate Alignments, anchored by those HSPs. caAACTGCTGaacgttgtcgtgagttctggctgcta-- --AACTGCTGggctctc-----ccgatcggctggcaaa This throws away the vast majority (99% in a random sample) of sequences in the target set.
  • 176. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net BLAST Search Spaces Program Query Type Database Type Number blastp Protein Protein 1x1 blastn Nucleotide Nucleotide 1x1 blastx Nucleotide* Protein 6x1 tblastn Protein Nucleotide* 1x6 tblastx Nucleotide* Nucleotide* 6x6 *Translated all 3 reading frames on both strands
  • 177. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net BLAST Scores • “Score” S = S(substitutions) – S(gaps) • “Bit Score” • Score, normalized for l and K, two parameters which should be left alone anyway, and converted to something looking vaguely information theoretic. Sn = [ lS - ln(K) ] / ln(2) • “E-Value” • “Expected number of hits of this score, in a target set of size n, with a query of length m” E = mn 2^Sn
  • 178. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net BLAST Search • Bits: Large scores are good • E-value: Small scores are good Sequences producing significant alignments: (bits) e-Value gi|4165369|dbj|AB008577.1|AB008577 Bos taurus mRNA for MHC ... 993 0.0 gi|3688210|emb|AJ010861.1|BTAJ10861 Bos taurus MHC class I ... 961 0.0 gi|2864714|dbj|AB008598.1|AB008598 Bos taurus mRNA for MHC ... 882 0.0 gi|2864712|dbj|AB008597.1|AB008597 Bos taurus mRNA for MHC ... 827 0.0 gi|3688212|emb|AJ010862.1|BTAJ10862 Bos taurus MHC class I ... 803 0.0 gi|2864815|dbj|AB008649.1|AB008649 Bos taurus mRNA for MHC ... 783 0.0 … gi|4106072|gb|AF055348.1|AF055348 Diceros bicornis minor cl... 549 e-154 … gi|18699296|gb|AF464053.1| Sus scrofa MHC class I antigen (... 468 e-129 gi|188474|gb|M84694.1|HUMMHHLAB4 Human MHC class I HLA-B*40... 462 e-127 …
  • 179. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net E-Value 2.71828182845904523536028747 • Unstable: – Change every time the dataset grows. • E-Values are not probabilities – Yet people seem to treat them as though they are • Rules of thumb: – 10-30: A good, solid hit. Take it to the lab and verify it. – 10-10: Okay. Base some further literature search on this. – 1: Threshold of random chance – 10: BLAST default cutoff
  • 180. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net “Low Complexity Regions” • By default, BLAST filters out regions of “Low Complexity” and replaces them with “XXXXX” In the alignments. • This may or may not be what you want.
  • 181. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Potential Problems • Round off errors • Can fail the ‘diff’ test between 32 and 64 bit architectures • “Silent” errors • Check those logfiles. • Parsing • Please do not write another BLAST output parser. • There are too many of them already in the world. • Seriously. • I’m not kidding about this one. • Shadowing: • Omission of interesting short hits in favor of less interesting but longer hits.
  • 182. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net BLAST Implementations • NCBI BLAST • NCBI Web Site • NCBI command line tools • Washington University BLAST • (web based & command line) • TIGR online searches • Los Alamos National Lab – MPI-BLAST • TimeLogic Corporation • “Tera-BLAST” • Everyone else in the world…
  • 183. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Is There A Parallel BLAST? Yes.
  • 184. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Multiple Sequence Alignment • Given a family of related sequences, construct an optimal multiple sequence alignment (MSA). • Based on that MSA, construct models which can be used to recognize as yet unrecognized members of the set.
  • 185. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Multiple Sequence Alignments • Patterns • Motifs • Position Specific Scoring Matrixes • Hidden Markov Models • Neural Networks
  • 186. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Danger Points • No longer computing similarity to any single observed sequence (what would they test in the lab?) • “Transitive Catastrophe” • Statistical Starvation.
  • 187. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Beware Intellectual Inbreeding • Using known protein families, we compute costs for amino acid substitutions. • Using those costs, we search for potential homologies and new (putative) families. • Build statistical models based on putative protein families • Rediscover known families with statistical techniques • Does this provide independent confirmation?
  • 188. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Example: ClustalW • Align each sequence to each other sequence • Select a seed alignment • Build up a multiple alignment from the pieces • Works great for close relatives
  • 189. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Conserved Patterns • Motifs: – Conserved substrings in multiple alignments / sets of sequences • Position Specific Scoring Matrixes. – Add “at each position in an alignment” to the work of Dayhoff.
  • 190. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net What is HMMer? • Written by Sean Eddy at Wash U • Open Source • 15 separate executables • Build a statistical model of a multiple sequence alignment • Search sequence databases with models • Search model databases with sequences
  • 191. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Search Shrimp for globin • Build a HMM model from 50 globins % hmmbuild globin.hmm globins50.msf • Calibrate the model % hmmcalibrate globin.hmm • Search shrimp sequence database with model % hmmsearch globin.hmm Artemia.fa • Search model database with shrimp sequences % hmmpfam globin.hmm Artemia.fa
  • 192. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net MSF Format… DNA_MULTIPLE_ALIGNMENT 1.0 Three anthropoidea MSF: 50 Type: N Check: 2666 .. Name: Homo_sapiens Len: 50 Check: 8318 Weight: 1.00 Name: Pan_paniscus Len: 50 Check: 7854 Weight: 1.00 Name: Gorilla_gorilla Len: 50 Check: 7778 Weight: 1.00 // Homo_sapiens AGUCGAGUC...GCAGAAAC Pan_paniscus AGUCGCGUCG..GCAGAAAC Gorilla_gorilla AGUCGCGUCG..GCAGAUAC Homo_sapiens GCAUGAC.GACCACAUUUU. Pan_paniscus GCAUGACGGACCACAUCAU. Gorilla_gorilla GCAUCACGGAC.ACAUCAUC Homo_sapiens CCUUGCAAAG Pan_paniscus CCUUGCAAAG Gorilla_gorilla CCUCGCAGAG
  • 193. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net hmm State Diagram
  • 194. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net hmm Format HMMER2.0 [2.2g] NAME globins50 LENG 148 ALPH Amino RF no CS no MAP yes COM ../binaries/hmmbuild globin.hmm globins50.msf COM ../binaries/hmmcalibrate globin.hmm NSEQ 50 DATE Thu Jul 25 10:51:38 2002 CKSUM 9858 XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455 NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -41.853970 0.212647 HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -661 * -1444 1 77 -228 -1302 -1020 -730 -1034 -756 578 -803 -375 82 -791 - 1461 -720 -959 364 -94 2204 -1315 -857 9 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -39 -5807 -6849 -894 -1115 -701 -1378 -661 *
  • 195. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net What else could be bioinformatics? • Fold / Structure / Docking / Function predictions on proteins and bioactive molecules • Ontology building / literature searches / text mining / knowledge management • Image processing to support lab automation / data capture / experiment steering • Medical records integration with proteomic / transcript studies • Expert systems / AI / Clinical / Lab assistant • Virtual organizations, distributed databases, ad hoc expert conversations…
  • 196. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net One would expect wet-lab scientists to have a healthy skepticism of any results, knowing how often experiments fail, and how much bad data has made it out into the literature, but many seem to have an almost mystical faith in anything produced by computation. On the other hand, computational people seem to have an almost mystical faith in wet-lab verification---expecting experiments to be neat, quick deterministic tests like "if" statements in code. - Gordon D. Pusch
  • 197. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net What can I do today? • CS: – Take biology coursework – Accept that biology is really, really complex and difficult. • Bio: – Take CS coursework – Accept that computer engineering / software development is tricky. • Administrators: – Decide to build a “spire, which will be visible from afar” • All: – Attend Journal Clubs, symposia, etc. – Get a bigger monitor
  • 198. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Thank you