Intro bioinformatics

© 2004: The BioTeam
http://bioteam.net
cdwan@bioteam.net
Genomic Biology and
Bioinformatics
The BioTeam

http://bioteam.net
cdwan@bioteam.net
BioTeam™ Inc.
• Objective & vendor neutral informatics and ‘bio-IT’ consulting
• Composed of scientists who learned to bridge the gap between life
science informatics and high performance IT
• “iNquiry” bioinformatics cluster solution
• Staff
Michael Athanas Bill Van Etten
Chris Dagdigian Stan Gloss
Chris Dwan
http://bioteam.net

http://bioteam.net
cdwan@bioteam.net
Goal of this session
• Introduce major concepts in genetics, genomics, and
bioinformatics.
• Provide a minimal vocabulary to enable communication.
• Enable communication between the disciplines
Please ask questions

http://bioteam.net
cdwan@bioteam.net
Outline
• Genetics to Genomics
• Data formats & Resources
• Sequence Analysis

http://bioteam.net
cdwan@bioteam.net
Goals
• Build shared vocabulary, global view
• Introduce online and text resources
• Build interest
Not:
• Teaching molecular biology
• Teaching bioinformatics

http://bioteam.net
cdwan@bioteam.net
Motivation for this session
• Bioinformatics will be the major new application
domain for High Performance Computing (HPC)
applications over the next 50 years.
• Life Scientists will walk into the computing
center, wanting to work with you (or you will
walk into their lab…)
• No need to repeat old mistakes.

http://bioteam.net
cdwan@bioteam.net
What is Bioinformatics?
• http://bioinformatics.org/faq/#definitions
– Computational Biology
– Systems Biology
– Genetics
– Biology
– *-omics
• The application of high performance computing and data
handling techniques to life sciences research
• A major revenue stream, with lots of hype

http://bioteam.net
cdwan@bioteam.net

http://bioteam.net
cdwan@bioteam.net
Genome Sizes (in base pairs)
• HIV (type 1) HIV 9,750
• Esceria Coli E. Coli 4x106
• Saccharomyces cerevisiae yeast 107
• Oryza Sativa rice 108
• Arabidopsis Thaliana “mouse-ear cress” 108
• Drosophila Melanogaster Fruit Fly 1.8x108
• Bos Taurus Cow 3x109
• Homo Sapiens Human 3x109
• Zea Mays corn 5x109
• Pinus resinosa Pine 7x1010
• Amoeba Dubia amoeba 6.7x1011

http://bioteam.net
cdwan@bioteam.net
In context (Jan, 2004)
• Complete genomes: ~800
• 19 eukaryotic
• 16 archea
• 64 bacteria
• The rest: Viruses
• Eukaryotes with at least one sequence in GenBank:
• Between 50,000 and 100,0000
• Distinct Species
• 1.4x106 uniquely named species
• ~107 distinct species

http://bioteam.net
cdwan@bioteam.net
Genome Sizes (in base pairs)

http://bioteam.net
cdwan@bioteam.net
What else could be
bioinformatics?
• Fold / Structure / Docking / Function predictions on proteins
and bioactive molecules
• Ontology building / literature searches / text mining /
knowledge management
• Image processing to support lab automation / data capture /
experiment steering
• Medical records integration with proteomic / transcript
studies
• Expert systems / AI / Clinical / Lab assistant
• Virtual organizations, distributed databases, ad hoc expert
conversations…

http://bioteam.net
cdwan@bioteam.net
Suffixes
• “ology”:
• Biology, Physiology, Embryology, Terminology
• Homology? Homo = same; logy = origin
• “ics”:
• Physics, Linguistics, Statistics, Bioinformatics
• “ome”:
• Proteome, Genome, Transcriptome,
• Chromosome? Chromo = color; soma = body;
• “ome-ics”:
• Proteomics, genomics
• Economics?

http://bioteam.net
cdwan@bioteam.net
Topics in Genomics
• The Central Dogma
• Levels of structure and interaction
• The Chromosome Model
• DNA Sequencing
• Genome Assembly
• Transcripts and Expression Levels
• Protein Folding
• Protein Interaction

http://bioteam.net
cdwan@bioteam.net
What I want you to remember
• Genotype vs. Phenotype
• Levels of Structure (primary -> quaternary)
• Homology is boolean
• It’s more complicated than they will admit (at first)
• http://www.bioinformatics.org
• http://www.ncbi.nih.gov
Bioinformatics is Biology

http://bioteam.net
cdwan@bioteam.net
Real Question (July 15, 2002)
“We have 10,000 BAC end reads from an organism
with massive synteny to a model organism. We
want to map markers from the model onto the
putative homologs in the BAC clones so that we can
do directed sequencing.”

http://bioteam.net
cdwan@bioteam.net
Example Question
We have 10,000 BAC end reads from an organism with massive synteny to
a model organism. We want to map markers in the model onto the
putative homologs in the BAC clones so that we can do directed
sequencing.
• What is a BAC end read? How does it differ from a BAC clone?
• What is a Homolog? Given that, what is a “putative” one?
• What is “Synteny?” Is it different from homology?
• What is a model organism?
• What are “markers?”
How can I best help this person?

http://bioteam.net
cdwan@bioteam.net
Real Question (May 30, 2002)
“Tell me all the kinases which have a valine or an
argenine within 2 angstroms of the active site.”
• What is a kinase?
• What are valine and argenine?
• What is an active site?

http://bioteam.net
cdwan@bioteam.net
Why Put The Biology First?
“Bioinformatics is full of pitfalls for those who look for
patterns or make predictions without a thorough
understanding of where biological data comes from
and what it means”
Nevin Young PhD
Professor, University of Minnesota

http://bioteam.net
cdwan@bioteam.net
A New Way of Thinking
• "The new paradigm, now emerging, is that all genes will
be known (in the sense of being resident in databases
available electronically), and that the starting point of a
biological investigation will be theoretical.”
- Walter Gilbert, 1993
speculating on the nature of biology in the "post-genome era"

http://bioteam.net
cdwan@bioteam.net
Genetics to Genomics
• 1600’s: Europe emerges from the dark ages
• 1822 - 1884: Gregor Mendel
• 1920’s: Genetic Mapping (Morgan)
• 1952: DNA is Genetic Material (Hershey)
• 1953: DNA Helix (W & C, Franklin)
• 1966- Genetic Code (Nirenberg, Khorana)
• 1977- DNA Sequenced (Sanger)
• 1988- Human Genome Project Started
• 2001- Human Genome Draft Finished

http://bioteam.net
cdwan@bioteam.net
Selective Breeding

http://bioteam.net
cdwan@bioteam.net
Francesco Redi: 1626-1697
• Prevailing Theory “Spontaneous Generation”
– Meat makes maggots
– Straw makes mice
• Experiment:
– Meat in two jars, one open one sealed.
– Observe flies -> eggs -> maggots -> flies
– nothing happens to the closed jar meat
• Inference: Flies make flies.
• Confirmed by Pasteur in mid 1800’s

http://bioteam.net
cdwan@bioteam.net
Science Marches On!
• 1651 - William Harvey
• Theory: “Ex Ovo Omnia” From the egg, everything! (No evidence whatsoever)
• 1827 - Karl Ernst von Baer
• First mammalian egg observed under a microscope. (dog)
• 1868 - Friedrich Miescher
• DNA (“Nuclein”) first observed. (Surgical bandages from soldiers)
• 1875 - Oscar Hertwig
• Observed that fertilization in both animals and plants consists of the physical union
of the two nuclei contributed by the male and female parents. (Sea Urchin)
• 1882: Walther Flemming
• Observed chromosomes by staining cells at Meiosis (Salamander)

http://bioteam.net
cdwan@bioteam.net
Gregor Mendel (1822-1884)
• Monk, Interested in math &
gardening
• Selectively bred pea plants
– 28,000 plants over 7 years
– 7 distinct phenotypic traits.
• Published: 1866
• First Cited: 1900

http://bioteam.net
cdwan@bioteam.net
Why did Mendel succeed?
• Studied one characteristic at a time:
– Pea shape
– Internal color
– Seed-coat and flower color
– pod shape
– pod color
– flower position
– plant height
• Kept pedigrees and made several generations of crosses
• Kept track of numbers of progeny from each cross.
Mendel was really, really lucky.

http://bioteam.net
cdwan@bioteam.net
Genotype vs. Phenotype
• Genotype:
– Properties (not necessarily observable) that can be passed on to
offspring
– DNA code and other genetic properties
• Phenotype:
– Observable traits of the organism
– Things we can see
Farmers have known this for a long time

http://bioteam.net
cdwan@bioteam.net
Mendelian Genetics
• Genetic “factors” (genes) determine
phenotypic traits.
• Each organism has two instances
(alleles) of each gene.
• Independent assortment: One
copy from from each parent is
(selected at random) is passed on to
each progeny.

http://bioteam.net
cdwan@bioteam.net
Cell Division
• Mitosis:
• “Ordinary” cell division
• Start with 1 diploid cell
• End with 2 diploid cells
• No crossing over (or, if so, it
doesn’t matter)
• Meiosis:
• “Gametogenesis”
• Start with 1 diploid cell
• End with 4 haploid gamete
cells
• Crossing over occurs
(mechanism for independent
assortment)

http://bioteam.net
cdwan@bioteam.net
How was Mendel Lucky?
Mendel was lucky because:
• Peas are diploid
• The traits he studied were all far apart on the chromosomes
• He didn’t use a self fertilizing (or otherwise freakish) plant
Mendel was unlucky because:
• Despite being mostly correct, his paper was rejected by his
journal of choice
• He died before anyone discovered and cited his results
• People now think that he must have cleaned his data.

http://bioteam.net
cdwan@bioteam.net
“The Chromosome Model”
With this model, we can
look at the entire range of
molecular biology, from
chromosomes to base
pairs.
This is not a mechanism

http://bioteam.net
cdwan@bioteam.net
Chromosomes
• Chromo = color
• Soma = body
• Chromosomes:
– Colored (when stained) bodies that
appear in the cell at mitosis and
meiosis
– Appear in pairs, except in gamete
cells (sperm and ova), where they
are single.
– A good candidate for the location of
genes

http://bioteam.net
cdwan@bioteam.net
Science Marches On!
• 1902: Walter Sutton
– Evidence that Mendel’s genetic factors exist on chromosomes
(grasshoppers)
Metaphase Spread Karyotype

http://bioteam.net
cdwan@bioteam.net
Number of (different)
Chromosomes
Chimpanzee 48
Cabbage 18
Camel 70
Chicken 78
Cat 34
Dog 78
Human 46
Corn 20
Alligator 32

http://bioteam.net
cdwan@bioteam.net
Chromosome Copies: “Ploidy”
“Number of copies of each chromosome”
• 2 = Diploid:
– Humans (and the majority of other eukaryotes)
• 4 = Tetraploid:
– Pine Trees
• 6 = Hexaploid:
– ??
• 8 = Octoploid:
– Starfish

http://bioteam.net
cdwan@bioteam.net
Thomas Morgan (1866-1945)
• “The Fly Room”
– Breeding experiments on Drosophila
Melanogaster (Columbia University)
• Alfred Sturtevant:
– First Chromosome Map
• Calvin Bridges:
– Chromosome theory of Heredity
• Hermann Muller:
– Mutations can be induced by X-ray
irradiation

http://bioteam.net
cdwan@bioteam.net
Why Model Organisms?
• Fruitflies:
• Only eight chromosomes.
• Reproduce very quickly, with lots of offspring.
• Tiny, so they don't take up a lot of room in the lab.
• They don't need a whole lot of food to survive.
• More Recently:
• Small genome
• Easily transformed
• Numerous mutants
• Well funded research community

http://bioteam.net
cdwan@bioteam.net
Some modern models
• Drosophila Melanogaster
• Mus Musculus
• Anopheles Gambiae
• Arabidopsis Thaliana
• Medicago Truncatula
• Oryza Sativa
• Glycene Max
• Zea Mays

http://bioteam.net
cdwan@bioteam.net
Prokaryotes vs. Eukaryotes
• Viruses: (102 genes, 104 base pairs)
• Prokaryote: (103 genes, 106 base pairs)
• No Nucleus (Mostly bacteria)
• No Introns (genes read continuously)
• One circular chromosome
• Genes clumped together in “operons”
• Much simpler genetics. Also much harder to see.
• Eukaryote: (104 genes, 109 base pairs)
• Nucleated
• Introns (Genes have untranslated “stuff” stuck in them)
• Many, linear chromosomes
• Genes spread out all over the place
• Multi-cellular and therefore more interesting.

http://bioteam.net
cdwan@bioteam.net
Chromosome Mapping
y3 – 12 y2 + 2y +4 = 0
Alfred Sturtevant was an
undergraduate working in Morgan’s
lab who (the story goes) set aside his
algebra homework one night to
create the first genetic map.

http://bioteam.net
cdwan@bioteam.net
Crossing Over

http://bioteam.net
cdwan@bioteam.net
Chromosome Mapping
• Linked Genes:
– Recombine less frequently than expected by Mendel’s law of
independent assortment
– Frequency of recombination  distance
– Sturtevent called the unit of distance “map units”
– Frequently referred to as “centiMorgans” after Dr. Morgan

http://bioteam.net
cdwan@bioteam.net
A Genetic Map of Drosophila
Note that we’re
still not looking at
DNA sequences.

http://bioteam.net
cdwan@bioteam.net
DNA is the Genetic Material
1943: Oswald Avery et. al.
sacrifice mice to demonstrate
that DNA could be the material
for genes. ( to one part in
6x108)
1952: Alfred Hershey and
Martha Chase use viruses to
prove it.
“Perhaps we will be able to grind
genes in a mortar and cook
them in a beaker after all.”
-Hermann Muller

http://bioteam.net
cdwan@bioteam.net
“At the time it was believed that DNA was a stupid
substance. A tetranucleotide which couldn’t do
anything specific.”
-Max Delbruck

http://bioteam.net
cdwan@bioteam.net
Nobel Milestones
• 1953 - 3D Structure of DNA
– Watson & Crick - model
– Wilkins & Franklin -x-ray structure
– Nobel in 1962

http://bioteam.net
cdwan@bioteam.net
1952: Watson & Crick Structure
• Nucleotides
– ‘A’ Adenine
– ‘G’ Guanine
– ‘C’ Cytosine
– ‘T’ Thyamine
“It has not escaped our attention that
the specific pairing we have
postulated immediately suggests a
possible copying mechanism for the
genetic material.”
Watson & Crick, 1952

http://bioteam.net
cdwan@bioteam.net
Deoxyribonucleic Acid
Chromosomes are long chains
of nucleotides in complementary
strands…
...AAACTGGAGCTCACCGCGGTGGCGGC...
...GGGTCAAGATCTGTTATAACAATAAT...
Complementary single strands
have strong affinity for each
other:
G pairs with A, T pairs with C.

http://bioteam.net
cdwan@bioteam.net
“The Chromosome Model”
With this model, we can
look at the entire range of
molecular biology, from
chromosomes to base
pairs.

http://bioteam.net
cdwan@bioteam.net
Nobel Milestones
• 1959 – 3D Structure of a Protein
– Perutz & Kendrew
– structure of myoglobin & hemoglobin
– Nobel in 1962

http://bioteam.net
cdwan@bioteam.net
Nobel Milestones
• 1970’s – Nucleic Acid Chemistry
– Paul Berg – recombinant DNA
– Gilbert & Sanger – sequencing
– Nobel in 1980

http://bioteam.net
cdwan@bioteam.net
Sequencing
• First DNA sequence
published by Sanger, 1955
• Generate all possible
subsequences from a fixed
5’ end (primer)
• Sort them by weight
• Read terminal nucleotide

http://bioteam.net
cdwan@bioteam.net
Sanger Sequencing
…AGTCCTG
…AGTCCT
…AGTCC
…AGTC
…AGT
…AG
…A
G A T C
•DNA of all possible lengths from a
known starting point
•Each strand ends with a radioactive
“didioxy” nucleotide which terminates
the chain
•The strands are “weighed” using gel
electrophoresis

http://bioteam.net
cdwan@bioteam.net
Modern Sequencing
• Accomplished in a single capillary tube
• Results read via a laser spectrometer
• Accurate to ~700bp
• Completely automated (~$0.04 / bp in 2003)

http://bioteam.net
cdwan@bioteam.net
Sequence Data, Errors
• Error rates for a single read = 0.002
• One error per read sequence, on average
• Types of error:
• Rare - Misreads
• Common - Deletions / double-reads
• Insertion of sequence from the vector
• Contamination with human or E. Coli DNA
• Quality tapers off at the end of a read

http://bioteam.net
cdwan@bioteam.net
Nucleotide Ambiguity Codes
A = Adenine G = Guanine T = Thymine C = Cytocine
R = A + G Y = C + T K = G + T
M = A + C S = C + G
W = A + T
V = A + C + G B = C + G + T
H = A + C + T
D = A + G + T
N = A + G + T + C
I = hypoxanthine
!(i/[GATCsn]+/)

http://bioteam.net
cdwan@bioteam.net
Restriction Enzymes
• Cut DNA at a specific subtring (different for each restriction enzyme)
…GGCTAGATTCCCTAGTTCGCTAATCGCT…
||||||||||||||||||||||||||||
…CCGATCTAAGGGATCAAGCGATTAGCGA…
Cut with “CTAGT” Restriction Enzyme
…GGCTAGATTCCCTAGA TCGCTAATCGCT…
||||||||||| ||||||||||||
…CCGATCTAAGG GATCTAGCGATTAGCGA…
Sticky Ends

http://bioteam.net
cdwan@bioteam.net
Restriction Enzymes
• “Cut” DNA only at a substring specific to the
restriction enzyme.
• Statistically, these substrings will occur several
times along the length of a chromosome:
Chromosome
Cut Sites

http://bioteam.net
cdwan@bioteam.net
Vectors
• Circular pieces of DNA with a cut site
• Used to capture pieces of DNA
Insertion site
…GGCTAGATTCCCTAGA TCGCTAATCGCT…
||||||||||| ||||||||||||
…CCGATCTAAGG GATCTAGCGATTAGCGA…
Sticky Ends

http://bioteam.net
cdwan@bioteam.net
Modern vectors
Many possible
Insertion sites
Gene coding for a brightly
colored protein so we can
visually distinguish vectors with
inserts from those without
Gene conveying
resistance to ampicillin

http://bioteam.net
cdwan@bioteam.net
Making Insert Libraries
• Separate out DNA from target organism
• Use PCR to make lots of copies of the DNA
• Cut with restriction enzymes, with vectors present in solution
• Place vectors into e. coli cells
• Spread vectorized e. coli onto agar plates
• Let grow overnight on medium with ampicillin
• Transfer only non-blue colonies into multi well plates (96 or 384).
• Sequence all the wells.
• What do you get after all this fun?
Thousands of “clone libraries” in a freezer somewhere

http://bioteam.net
cdwan@bioteam.net
Sizes of Insert Libraries
• Phage Library:
• 5 - 3,000bp
• Bacterial Artificial Chromosome (BAC):
• 80,000 - 100,000 bp
• Yeast Artificial Chromosome (YAC):
• 150,000 - 200,000 bp

http://bioteam.net
cdwan@bioteam.net
Restriction Enzymes
Restriction Fragments
• By controlling the relative amounts of DNA and
restriction enzyme, we can produce a large set of
smaller chromosome fragments

http://bioteam.net
cdwan@bioteam.net
BAC End Sequences
Restriction Fragments
• It is “easy” to read the 700bp at each end of the
insert libraries

http://bioteam.net
cdwan@bioteam.net
How Many Fragments?
• For a 5 letter (5-mer) restriction enzyme, odds of randomly
hitting the target sequence are approximately:
(1/4)5 = 1/1024 ≈ 10-3
• If a genome of interest is about 3x109 bp this gives us
approximately:
3x106 segments
• Using 3 or 4 unrealistic assumptions….

http://bioteam.net
cdwan@bioteam.net
Genome Sequencing: BAC Tiling
• Directed BAC Sequencing
– Read all BAC Ends & Fingerprints
– Create the minimal tiling path to cover each chromosome
– Sequence each BAC using smaller insert libraries (but the
same basic idea)
– Close Gaps (primer walking)

http://bioteam.net
cdwan@bioteam.net
Directed BAC Sequencing
Minimum Tiling Path

http://bioteam.net
cdwan@bioteam.net
Shotgun Sequencing
• Use inserts of approximately 1,000bp
• No pre-processing or ordering, use computational
techniques to assemble larger and larger fragments
• Entirely automated
• Works a lot better if someone else is doing BAC
sequencing in the public domain

http://bioteam.net
cdwan@bioteam.net
Finishing a Genome
• Sequence ought to be derived from a mixture of anonymous
individuals
• Hard to finish regions:
– Telomere
– Centromere
– Highly variable regions
• 10x coverage, 99% assembly
• Standards vary by community.

http://bioteam.net
cdwan@bioteam.net
We have a genome, now what?
• Where are the genes?
• How are genes controlled / activated?
• Can we add to / subtract from the genome?
• Why is there all that extra “junk” in there?
• What genes are common between organisms?

http://bioteam.net
cdwan@bioteam.net
Topics in Genomics
• Levels of structure and interaction
• DNA Sequencing
• Genome Assembly
• Transcripts and Gene Expression
• Protein Folding
• Protein Interaction

http://bioteam.net
cdwan@bioteam.net
Central Dogma
DNA
•Four Base Pairs:
•GATC
•Double Stranded
•G->A
•T->C
•Packaged in
Chromosomes
RNA
•T->U
•Single Stranded
•Mechanism for
differential gene
expression
Amino Acid
Chains
•20 amino acids
•“Genetic Code”
translates 3 RNA to 1
amino acid
Transcription Translation
All disciplines should have the guts to admit
to having a “central dogma”

http://bioteam.net
cdwan@bioteam.net
Levels of Structure
• Primary Sequence
• Secondary Local properties
• Hydrophobic / hydrophilic regions.
• a-Helices and b-sheets
• Tertiary 3-d structure
• Quaternary Interaction
• Protein-protein interactions
• post transcriptional modification
• Enzymatic action
• $$$$

http://bioteam.net
cdwan@bioteam.net
What is a “gene?”
• “The fundamental unit of genetic inheritance”
• “One gene, one transcript”
• One gene, one splice variant
• “One gene, one protein”
• “One gene, one heritable trait”

http://bioteam.net
cdwan@bioteam.net
Nobel Milestones
• 1960’s – Genetic Code
• Holley, Khorana and Nirenberg
• Rosetta Stone of Life
• Nobel in 1968

http://bioteam.net
cdwan@bioteam.net
The Genetic Code

http://bioteam.net
cdwan@bioteam.net
Gamow and the Genetic Code

http://bioteam.net
cdwan@bioteam.net
Transcription & Translation
…GATC…
…CTAG…DNA
…GAUC…mRNA
Amino Acid Chain
Transcription
Translation (in one of six possible
“Reading Frames”)
…RIDVLKGEKALKASGLVP…
Protein
Folding
Anthrax Toxin Delivery Factor

http://bioteam.net
cdwan@bioteam.net
Eukaryotic genes contain Introns

http://bioteam.net
cdwan@bioteam.net
But wait, there’s more
“Promoter”
TATA
“Start”
ATG
“Stop”
TAA
mRNA
Splicing
RNA
DNA
Introns (non
coding regions)
are removed
AAA(A100+)
Poly-A tail is attached
Open Reading Frame (ORF)
Six reading frames are possible

http://bioteam.net
cdwan@bioteam.net
Expression Level
• “What protein is being made / which gene is being
turned on when <your question here>?”
• Can approximate this with mRNA levels.
– Translation does not occur at a fixed rate
– Proteins degrade at radically different rates
– Some mRNA is never translated

http://bioteam.net
cdwan@bioteam.net
Expressed Sequence Tags
1. Select organism to study
2. Chop up organism into “libraries”
representing interesting tissues,
developmental stages, or experimental
conditions.
3. Extract and sequence as many cDNAs
as possible from each library.
4. Compare sequences to determine:
• Tissue specific gene expression
• Hypothetical functions for proteins
• Expression levels (relative concentration in
cytoplasm)

http://bioteam.net
cdwan@bioteam.net
Expressed Sequence Tags
Cell
2. Use Reverse Transcriptase
(poly-T primer) to create
cDNA
AAAAA(A100+)
1. Use Enzymes to digest DNA
& Proteins, leaving mRNA
TTTTT(T)
4. Sequence (via a complex
procedure omitted here for the
sake of brevity) the cDNA.
3. Capture the cDNA strand in
vector and incorporate into E.
Coli cells to replicate.

http://bioteam.net
cdwan@bioteam.net
EST Data
Reads of the same cDNA (product
of the same gene) produce an
assortment of sequences sharing
the Poly-A 3’, and extending a
random distance toward the 5’
end.
Issues:
• Sequence contamination with E. Coli, or vector
• Spurious groupings of cDNA from different genes containing
similar regions
• Omission of genes due to low concentration or lack of
expression (solve with additional libraries)

http://bioteam.net
cdwan@bioteam.net
ESTs are Popular
• Human: 4x109 sequences
• Mouse: 2x109 sequences
• Medicago Truncatula: 1.6x106
• Read only the genes which are being expressed
• Get crude information about expression levels based on
frequency of a certain sequence.
• If a genome sequence is available, can locate genes on
chromosomes using similarity search

http://bioteam.net
cdwan@bioteam.net
Southern Blot
• Affix “target” single stranded sequence to a nylon membrane
• Label “probe” single stranded sequences (mRNA from cells) with a
fluorescent dye
• Wash probe over target
• Similar sequences will hybridize (stick together)
• Check for fluorescence
Target
Probe
Flourescent
Label

http://bioteam.net
cdwan@bioteam.net
Micro / Macroarrays
• Stick (hybridize) single stranded DNA
to some surface (glass slide or nylon
membrane)
• Attach fluorescent markers to the
single stranded “probe” control
sample
• Attach a different frequency of
fluorescent marker to experimentally
stressed probe sequences
• Wash probes over targets. (like will
stick to like)
• Illuminate with laser and record
differential frequency response

http://bioteam.net
cdwan@bioteam.net
Microarray Data

http://bioteam.net
cdwan@bioteam.net
Gene Chips (2003)
• 20bp sequences built using
photolithography
• Sequence must be known in
advance
• $200-$500 per “chip” from
Affymetrix (and others)
• Tools for data analysis also
available for $$

http://bioteam.net
cdwan@bioteam.net
Microarrays vs Gene Chips
• Microarrays
• Cheap to create
• No need to know sequences ahead of time (just use sample that
is already in the freezer
• Gene Chips
• Initially expensive to create
• All target sequences already known
• “The mouse chip.” “The human chip”

http://bioteam.net
cdwan@bioteam.net
Time Course Experiments
• At t=0, 5, 10, … from start of condition x
• What genes are up and down regulated
• What gene clusters seem to move together?

http://bioteam.net
cdwan@bioteam.net
Quality of Microarray data
• Spot location
• Spot size
• Differential Hybridization
• Errors in “swishing” of the probes
• In general, only differences of 1s and above are
significant.

http://bioteam.net
cdwan@bioteam.net
Aspects of Protein Structure
1 XMNFSGKYQV QSQENFEPFM KAMGLPEDLI QKGKDIKGVS EIVHEGKKVK
51 LTITYGSKVI HNEFTLGEEX ELETMTGEKV KAVVKMEGDN KMVTTFKGIK
101 SVTEFNGDTI TNTMTLGDIV YKRVSKRI

http://bioteam.net
cdwan@bioteam.net
Amino Acid Codes
Alanine Ala A Arginine Arg R
Asparagine Asn N Aspartic Acid Asp D
Cysteine Cys C Glutamic Acid Glu E
Glutamine Gln Q Glyceine Gly G
Histidine His H Isoleucine Ile I
Leucine Leu L Lysine Lys K
Methionine Met M Phenylalanine Phe F
Proline Pro P Serine Ser S
Threonine Thr T Tryptophan Trp W
Tyrosine Tyr Y Valine Val V
Any Amino Acid:Z
Unknown Amino Acid: X

http://bioteam.net
cdwan@bioteam.net
A bit more about Alanine
Molecular Structure
CH3-CH(NH2)-COOH
Molecular formula
C3H7NO2
Molecular weight:
89.09
Isoelectric point (pH):
6.00
CAS Registry Number:
56-41-7

http://bioteam.net
cdwan@bioteam.net
Protein Structure is Difficult
• There is, presently, no high throughput solution to
determining protein structure
• Crystal structure with X-Ray Crystallography
• MALDI-TOF
• Computational Techniques (not mature beyond secondary
structure)

http://bioteam.net
cdwan@bioteam.net
Dangers of Protein Structures
• If DNA sequences are cartoons…
• Protein structures are even less than that.
– Crystalline form (non biologically active)
– Low temperature
– No interactions with other molecules

http://bioteam.net
cdwan@bioteam.net
Massively parallel biology
• Sequencing:
– Large centers produce multiple megabases per day, run
24 by 7
• Expression:
– Microarrays: 100,000 “spots” in parallel.
– 1um diameter
– Read with scanning laser
– Petabytes of image data soon

http://bioteam.net
cdwan@bioteam.net
Why the Explosion?
http://www.sanger.ac.uk/Info/IT/

http://bioteam.net
cdwan@bioteam.net
More…
• Proteomics
• Metabolomics
• Single Nucleotide Polymorphism (SNP)
• …
• Biochemical pathway analysis
• Protein - protein interaction
• …
• “Systems Biology”

http://bioteam.net
cdwan@bioteam.net
Sequence Based
Bioinformatics

http://bioteam.net
cdwan@bioteam.net
Levels of Structure (review)
• Primary Sequence
• Secondary Local properties
• Hydrophobic / hydrophilic regions.
• a-Helices and b-sheets
• Tertiary 3-d structure
• Quaternary Interaction
• Protein-protein interactions
• post transcriptional modification
• Enzymatic action

http://bioteam.net
cdwan@bioteam.net
Homology is evolutionary relation
• Homolog:
– Related by descent.
– This is a boolean property It is either true or false
• Can Occur Via:
– Duplication within a genome
– Separation by descent.

http://bioteam.net
cdwan@bioteam.net
Other Terms
• Synteny:
– Genes share ordering between species
• Ortholog: Related by speciation
• Paralog: Related by duplication
• Wet lab: Bubbling vats of goo
• Dry lab: Whirring fans

http://bioteam.net
cdwan@bioteam.net
Comparative Genomics

http://bioteam.net
cdwan@bioteam.net
Phylogenetic Reconstruction

http://bioteam.net
cdwan@bioteam.net
Chromosome scale
rearrangements
Remarkable similarity between
mouse and human chromosomes.
But what does this picture mean?
And how would we go about
computing it?
•Traditional gene maps?
•Markers?
•Sequence similarity?
•A combination of the wet and dry
lab?

http://bioteam.net
cdwan@bioteam.net
Genetic Database Collaboration
• NCBI
– National Center for Biotechnology Information
– GenBank
– http://www.ncbi.nlm.nih.gov
• EBI
– European Bioinformatics Institute
– EMBL - European Molecular Biology Laboratory
– http://www.ebi.ac.uk
• CIB
– Center for Information Biology
– DDBJ - DNA Data Bank of Japan
– http://www.ddbj.nig.ac.jp

http://bioteam.net
cdwan@bioteam.net
International Collaboration
NCBI CIB
EBI
Genbank DNA Databank of Japan
EMBL Nucleotide Sequence Database
Data are synchronized nightly between the three centers

http://bioteam.net
cdwan@bioteam.net
National Center for Biotechnology Information
• The genetic sequence database of the US National Institutes of Health
• International Nucleotide Sequence Database Collaboration:
– DNA DataBank of Japan (DDBJ)
– European Molecular Biology Laboratory (EMBL)
– GenBank
• 2x1010 bases in 1.7x107 sequences
• Release every two months, daily updates
http://www.ncbi.nih.gov

http://bioteam.net
cdwan@bioteam.net
Sequence Data Sets at NCBI
• ‘NT’
• Nucleotide sequence dataset.
• Quality standards include 7x read, 1x reverse
• ‘NR’
• Non-redundant (cough cough…)
• amino acid sequence dataset
• ‘EST’
• Expressed Sequence Tag data
• Low quality, different sort of data

http://bioteam.net
cdwan@bioteam.net
Transitive Catastrophe
• Sequences of low quality are annotated by similarity
to other sequences of low quality
• This can build a corpus of erroneous data
• Which will then be used to generate statistical
models and faster algorithms
• Which will be used to mis-annotate exponentially
increasing volumes of data

http://bioteam.net
cdwan@bioteam.net
More Sequence Data Sets
• Protein Database (PDB):
• Amino acid sequences for which a structure
has been experimentally determined
• SwissProt:
• Amino acid sequences with a high level of
annotation
• Genomes:
• All shapes and sizes

http://bioteam.net
cdwan@bioteam.net
Entrez (at NCBI)
• PubMed: The biomedical literature (PubMed)
• Nucleotide sequence database (Genbank)
• Protein sequence database
• Structure: three-dimensional macromolecular structures
• Genome: complete genome assemblies
• PopSet: population study data sets
• OMIM: Online Mendelian Inheritance in Man
• Taxonomy: organisms in GenBank
• Books: online books
• ProbeSet: gene expression and microarray datasets
• 3D Domains: domains from Entrez Structure
• UniSTS: markers and mapping data
• SNP: single nucleotide polymorphisms
• CDD: conserved domains

http://bioteam.net
cdwan@bioteam.net
Protein Structure Databases
• PDB - Protein DataBank
– Established in 1971 for protein structures
– http://www.pdb.org
– Now also includes nucleic acids, carbohydrates

http://bioteam.net
cdwan@bioteam.net
Protein Sequence Databases
• PIR - Protein Information Resource
– Protein Sequence Database (PIR-PSD)
– Established in 1984
– http://pir.georgetown.edu/
Year Amino Acid Residues Sequence Records
1984 526,466 2,676
2001 76,174,552 219,241

http://bioteam.net
cdwan@bioteam.net
Protein Sequence Databases
• SWISS-PROT
– Established in 1986
– http://www.expasy.org/sprot/
– Try to distinguish themselves by
• Annotation
• Minimal redunancy
• Integration with other databases

http://bioteam.net
cdwan@bioteam.net
More Data Resources
• The Institute for Genome Research (TIGR)
– http://www.tigr.org
• European Molecular Biology Institutes (EMBL)
– http://www.embl.org
• European Bioinformatics Institute (EBI)
– http://www.ebi.org
• SwissProt, Trembl:
– http://www.expasy.ch

http://bioteam.net
cdwan@bioteam.net
Ensembl
• EBI’s integrative genome data toolkit.
• A web based tool in which data from various
sources are associated with chromosome maps and
locations.
• http://www.embl.org

http://bioteam.net
cdwan@bioteam.net
Distributed Annotation System (DAS)
• Client / Server system for publishing annotations to
chromosomal data.
• http://www.biodas.org
• BioMOBY: Web Services genome annotation
framework

http://bioteam.net
cdwan@bioteam.net
Protein Structures
• SCOP: “Structural Classification of Proteins”
– Superfamily
– Family
– Fold
• CASP
– Competition for protein structure prediction programs
– Results are still lacking.

http://bioteam.net
cdwan@bioteam.net
Data types & Formats

http://bioteam.net
cdwan@bioteam.net
FASTA Format
>gi|4165369|dbj|AB008577.1|AB008577 Bos taurus mRNA for MHC class I
heavy chain, partial cds, clone MP-5.10m
AGGTATTTCCACACCGCCGTGTCTCGGCCCGGCCTCCGGGAGCCCCTCTTTATC
ACGTCGGCTACGTGGACGACACGCAGTTCGTGCGGTTCGACAGCGACGCCCGGG
ATCCGAGGAAAGAACCACGGCAGCCGTGGATGGAGAAGGAGGGGCCGGAGTATT
GGGATCGCGAGACTCAAATCTCCAAGGAAAACGCACTGAAGTACCGAGAGGCCT
TAACATCCTGCGCGGCTACTACAACCAGAGCGAGGCCGGGTCTCACACCTATCA
GCGGATGTACGGCTGCGACGTGGGGCCGGACGGGCGCCTCCTCAGCGGGTTCAC
GCAGTTCGGCTACGACGGCAGAGATTACATCGCCCTGAACGAGGACCTGCGCTC
CGGACCGCGGCGGACACGGCGGCTCAGATCACCAAGCGCAAGTGGGAGGCGGCC
GGTGAGGCGGAGAGATTCAGGAACTACGTGGAGGGCCGGTGCGTGGAGTGGCTC
CGCAGATACCTG

http://bioteam.net
cdwan@bioteam.net
FASTA Format
• Definition line:
• Required
• starts with ‘>’
• contains no line breaks
• Non-printing characters are frowned upon, but don’t break most tools.
Ctrl-A is used by some organizations to combine deflines in Unigene sets
• Data:
• Unlimited nucleotide or amino acid sequence, possibly filled with
whitespace and carriage returns.
• Capitalization does not matter (unless it does)
• FASTA files can (sometimes) be concatenated.

http://bioteam.net
cdwan@bioteam.net
GenBank Entry
LOCUS AB008577 501 bp mRNA linear MAM 22-JAN-1999
DEFINITION Bos taurus mRNA for MHC class I heavy chain, partial cds, clone
MP-5.10m.
ACCESSION AB008577
VERSION AB008577.1 GI:4165369
KEYWORDS MHC class I heavy chain.
SOURCE Bos taurus (variety:Holstein, isolate:MP-5) cultured T cells
cDNA to mRNA, clone:MP-5.10m.
ORGANISM Bos taurus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
Euteleostomi; Mammalia; Eutheria; Cetartiodactyla; Ruminantia; Pecora;
Bovoidea; Bovidae; Bovinae; Bos.
REFERENCE 1 (bases 1 to 501)
AUTHORS Urakawa,T., Kodama,M., Morita,M. and Ikeda,H.
TITLE Direct Submission
JOURNAL Submitted (02-NOV-1997) Toyohiko Urakawa, STAFF Institute, 2nd Division;
446-1 Ippaizuka, Kamiyokoba, Tsukuba, Ibaraki 305, Japan (E-
mail:urakawa@gene.staff.or.jp, Tel:+81-298-38-7757, Fax:+81-298-38-7880)

http://bioteam.net
cdwan@bioteam.net
Fun facts about GenBank
• Accession:
• Unique ID for this sequence: AB008577
• Version:
• Incremented with each update: AB008577.1
• GI:
• Old version of Accession
• Taxonomy ID:
• Link into NCBI’s Taxonomy tree
Only original authors can update data

http://bioteam.net
cdwan@bioteam.net
GenBank Entry
FEATURES Location/Qualifiers
/organism="Bos taurus“
/variety="Holstein“
/isolate="MP-5“
/db_xref="taxon:9913“
/clone="MP-5.10m“
/cell_type="cultured T cells“
/note="BoLA class I haplotype (A8A14/A6A19);
Common E group; RT-PCR amplified clone"
CDS <1..>501
/standard_name="MHC class I related gene“
/note="particial alpha 1and 2 domains“
/codon_start=1
/product="MHC class I heavy chain“
/protein_id="BAA37151.1“
/db_xref="GI:4165370“
/translation="RYFHTAVSRPGLREPLFITVGYVDDTQFVRFDSDARDPRKEPRQ
PWMEKEGPEYWDRETQISKENALKYREALNILRGYYNQSEAGSHTYQRMYGCDVGPDG
RLLSGFTQFGYDGRDYIALNEDLRSWTAADTAAQITKRKWEAAGEAERFRNYVEGRCV
EWLRRYL“

http://bioteam.net
cdwan@bioteam.net
GenBank Entry
BASE COUNT 105 a 148 c 173 g 75 t
ORIGIN
1 aggtatttcc acaccgccgt gtctcggccc ggcctccggg agcccctctt tatcaccgtc
61 ggctacgtgg acgacacgca gttcgtgcgg ttcgacagcg acgcccggga tccgaggaaa
121 gaaccacggc agccgtggat ggagaaggag gggccggagt attgggatcg cgagactcaa
181 atctccaagg aaaacgcact gaagtaccga gaggccttga acatcctgcg cggctactac
241 aaccagagcg aggccgggtc tcacacctat cagcggatgt acggctgcga cgtggggccg
301 gacgggcgcc tcctcagcgg gttcacgcag ttcggctacg acggcagaga ttacatcgcc
361 ctgaacgagg acctgcgctc ctggaccgcg gcggacacgg cggctcagat caccaagcgc
421 aagtgggagg cggccggtga ggcggagaga ttcaggaact acgtggaggg ccggtgcgtg
481 gagtggctcc gcagatacct g

http://bioteam.net
cdwan@bioteam.net
Ways to access data at NCBI
• http://www.ncbi.nih.gov
• Can use ENTREZ to define fairly unique sets of sequences and
download in batch
• ftp.ncbi.nih.gov:/blast/db
• Download the entire 15GB set of datasets
• http://www.bioperl.org
• Perl routines for automating small data retrieval jobs.

http://bioteam.net
cdwan@bioteam.net
NCBI Supported Formats
• ASCII GenBank Record
• FASTA
• ASN.1
• XML

http://bioteam.net
cdwan@bioteam.net
More file formats
• Chromatogram:
• Binary output of an automated sequencer
• Phd / phred / quality file:
• ASCII file combining bases and quality values.
• ASN.1:
• Binary representation of GenBank entries
• C and C++ libraries for accessing ASN.1 are maintained by NCBI

http://bioteam.net
cdwan@bioteam.net
Sequence Handling Tasks
• Base calling
– Chromatogram -> FASTA
• Sequence Cleaning
– Search for contamination
– Vector
– host DNA
– other common sequencing artifacts.
• Contig Assembly
• Genomic Assembly

http://bioteam.net
cdwan@bioteam.net
Unigene Sets
• Contigging:
– In EST projects, cDNA reads which are believed to originate from the same
mRNA transcript are associated into contiguous segments.
– Sets of these contigged (consensus) sequences are sometimes called
“Unigene Sets.”
– Programs for doing this include:
• phrap
• TIGR Assembler
• Consed
• Arachne

http://bioteam.net
cdwan@bioteam.net
Genomic Assembly
• Genomic Assembly:
– A time and labor intensive process by which gaps in the genomic
sequence are identified, primer pairs are constructed to target those
gaps, and additional sequencing is performed.
– There is no general solution to this, nor will there be.

http://bioteam.net
cdwan@bioteam.net
Microarray Analysis
• Data Management:
– GeneSpring and others: Web front end to an annotation
database for microarray informatio
• Analysis:
– Normalization
– Synthetic experiment design

http://bioteam.net
cdwan@bioteam.net
Biochemical Pathway Analysis
• Kyoto Encyclopedia of
Genes and Genomes
• http://www.genome.jp/kegg/

http://bioteam.net
cdwan@bioteam.net
Sequence Analysis

http://bioteam.net
cdwan@bioteam.net
Sequence Anaylsis Overview
• Properties of individual sequences
• Sequence alignment
• Alignment based search (BLAST)
• Multiple Sequence Alignment
• Motifs / etc.
• Statistical models / model based search

http://bioteam.net
cdwan@bioteam.net
Amino Acid Properties

http://bioteam.net
cdwan@bioteam.net
Similar Amino Acids
Tyrosine (Y) Phenylalanine (F)

http://bioteam.net
cdwan@bioteam.net
Similar Amino Acids
Aspartate (D)Glutamate (E)

http://bioteam.net
cdwan@bioteam.net
Examples from EMBOSS
• Pepstats
• Charge
• Compseq
• Pepwindow

http://bioteam.net
cdwan@bioteam.net
Comparing Sequences
>NXCI_115_B04_F 544 0 544 ABI
GTGGTAAAACTGGAGCTCACCGCGGTGGCGGCCGCTCT
ANAACTAGTGGATCCCCCGGGCTGCAGGAATTCGGCAC
GAGATTTTGACAGACATGAGCTCATATGCAGATGCTTT
GCGTGAAGTGTCTGCAGCTCGTGAAGAAGTGCCTGGCC
GACGTGGTTATCCTGGGTACATGTATACTGACTTGGCA
ACGATTTATGAACGGGCAGGACGTATTGAAGGCCGAAA
AGGCTCTATTACTCAGATTCCCATTCTGACCATGCCCA
ATGATGATATTACACACCCAATTCCAGATCTAACAGGT
TACATCACAGAAGGGCAGATATATATTGACAGGCAACT
TCATATCGACAGATATACCCACCAATCAATGTTCTTCC
ATCTCTATCACGATTGATGAAGAGTGCTATAGGGGAGG
GAATGACTCGACGGGATCATGCTGAAGTTTCAAATCAG
CTATAGCAAATTATGCAATTGGAAAGGATGTACAAGCA
ATGAAGGCTGTGGTTGGAGAGGAGGCCTTGTCATCAGA
GGATCTGCTG
>gi|4165369|dbj|AB008577.1|AB008577 Bos taurus mRNA for MHC
class I heavy chain, partial cds, clone MP-5.10m
AGGTATTTCCACACCGCCGTGTCTCGGCCCGGCCTCCGGGAG
CCCCTCTTTATCACGTCGGCTACGTGGACGACACGCAGTTCG
TGCGGTTCGACAGCGACGCCCGGGATCCGAGGAAAGAACCAC
GGCAGCCGTGGATGGAGAAGGAGGGGCCGGAGTATTGGGATC
GCGAGACTCAAATCTCCAAGGAAAACGCACTGAAGTACCGAG
AGGCCTTAACATCCTGCGCGGCTACTACAACCAGAGCGAGGC
CGGGTCTCACACCTATCAGCGGATGTACGGCTGCGACGTGGG
GCCGGACGGGCGCCTCCTCAGCGGGTTCACGCAGTTCGGCTA
CGACGGCAGAGATTACATCGCCCTGAACGAGGACCTGCGCTC
CGGACCGCGGCGGACACGGCGGCTCAGATCACCAAGCGCAAG
TGGAGGCGGCCGGTGAGGCGGAGAGATTCAGGAACTACGTGG
AGGGCCGGTGCGTGGAGTGGCTCCGCAGATACCTG

http://bioteam.net
cdwan@bioteam.net
Sequence Alignment
HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL
G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL
HBB_HUMAN GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL
HBA_HUMAN GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL
++ ++++H+ KV + +A ++ +L+ L+++H+ K
LGB2_LUPLU NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG
HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL
GS+ + G + +D L ++ H+ D+ A +AL D ++AH+
F11G11.2 GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFKAHQE

http://bioteam.net
cdwan@bioteam.net
Sequence Alignment, Fact 1
“In biomolecular sequences (DNA, RNA, or amino
acid sequences), high sequence similarity
usually implies significant functional or
structural similarity.”
Dan Gusfield
Algorithms on Strings, Trees, and Sequences. 1997. University of
Cambridge Press. p.212.

http://bioteam.net
cdwan@bioteam.net
Sequence Alignment, Fact 2
“Evolutionary and functionally related molecular strings can
differ significantly throughout much of the string and yet
preserve the same three-dimensional structure(s), or the
same two dimensional substructure(s) (motifs, domains), or
the same active sites, or the same or related dispersed
residues (DNA or amino acid).”
Dan Gusfield.
Algorithms on Strings, Trees, and Sequences. 1997. University of
Cambridge Press. p.334

http://bioteam.net
cdwan@bioteam.net
Sequence Alignment
• Why do sequences appear similar?
– common ancestry
– common function
– chance
• Terms
– Identity - identical matches
– Similarity - common properties
– Homolog - common ancestor (related by descent)
• Paralog - same species, different copy / function
• Ortholog - same function, different species

http://bioteam.net
cdwan@bioteam.net
Doolittle’s Twilight Zone
• Point at which two
sequences may appear
to be related based
only on random chance

http://bioteam.net
cdwan@bioteam.net
Dottup Example

http://bioteam.net
cdwan@bioteam.net
Sequence Alignment
• Aligning two sequences:
– Insert a minimum number of gaps into one or both
sequences to maximize matches
DDLMLSPDDLAQWLTEDPGPSEAPRMSE
|||:| | |: :: ||||| |:|
DDLLL-PQDVEEFF---EGPSEALRVSG

http://bioteam.net
cdwan@bioteam.net
Sequence Alignments
• Matches may be identical
• Matches may include similar but not identical
properties
|||:| | |: :: ||||| |:|
|||:| | |: :: ||||| |:|

http://bioteam.net
cdwan@bioteam.net
Evolution of String Comparison
• Hamming Distance (1951): The number of locations at which the two (binary)
strings of equal length differ.
• Levenshtein Distance (1961): The number of single character insertions,
deletions, or substitutions (edits) required to transform one sequence into
another.
• “Substitution Matrices” (Dayhoff, 1978): Use of a Substitution Matrix to
encode log likelihoods of substitutions.
• “Gapped Alignments” (Many authors, 1980+): Mathematical models for
allowing gaps in alignments
• “Statistical Models” (Many authors, 1982+): No longer aligning against a
specific string, but against the compiled statistics of sets of strings.

http://bioteam.net
cdwan@bioteam.net
Hamming Distance (1950’s)
Count of the differences between two sequences of identical
length
53/55 identical
ctggagctcaccgcggtggcggccgctcta
||||||||||||||||||||||||||||||
49/55 identical
gtaaagcccaccgcggtggcggccgctcta
| ||| ||||||||||||||||||||||

http://bioteam.net
cdwan@bioteam.net
Substitution Matrixes
• Margaret Dayhoff (1925-1983):
• “Percent Accepted Mutation” (PAM) 1973
• Substitution frequencies from “real” alignments of known
homologs, normalized to some percent mutation rate.
• 1300 sequences, 72 families, closely related within
families
• PAMij = 10(log10Rij)
• Rij = freq of (i -> j) / freq(i)
• PAM n = (PAM1)n

http://bioteam.net
cdwan@bioteam.net
PAM 250
A R N D C Q E G H I L K M F P S T W Y V B Z X *
A 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 0 0 0 -8
R -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 -1 0 -1 -8
N 0 0 2 2 -4 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 2 1 0 -8
D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 3 3 -1 -8
C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -4 -5 -3 -8
Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 1 3 -1 -8
E 0 -1 1 3 -5 2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 3 3 -1 -8
G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 0 0 -1 -8
H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 1 2 -1 -8
I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -2 -2 -1 -8
L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -3 -3 -1 -8
K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 1 0 -1 -8
M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -2 -2 -1 -8
F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -4 -5 -2 -8
P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 -1 0 -1 -8
S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -2 -3 -1 0 0 0 -8
T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 0 -1 0 -8
W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -5 -6 -4 -8
Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 -2 -3 -4 -2 -8
V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 -2 -2 -1 -8
B 0 -1 2 3 -4 1 3 0 1 -2 -3 1 -2 -4 -1 0 0 -5 -3 -2 3 2 -1 -8
Z 0 0 1 3 -5 3 3 0 2 -2 -3 0 -2 -5 0 0 -1 -6 -4 -2 2 3 -1 -8
X 0 -1 0 -1 -3 -1 -1 -1 -1 -1 -1 -1 -1 -2 -1 0 0 -4 -2 -1 -1 -1 -1 -8
* -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 1

http://bioteam.net
cdwan@bioteam.net
BLOcks Substitution Matrix (BLOSUM)
• Steven Henikoff, 1989
• Calculated frequency of substitutions in
conserved motifs, rather than across the
global alignments.

http://bioteam.net
cdwan@bioteam.net
Scoring gapped alignments
• Fixed cost to open a gap
• Weighted (affine) cost to increase an existing gap.
• Models biological events better than a fixed cost
• To score one alignment:
– Sum substitution scores and gap costs.
• To find the best possible alignment:
– Calculate score for all possible alignments
– Pick the best one.

http://bioteam.net
cdwan@bioteam.net
Global Alignments
• May miss conserved domains/motifs

http://bioteam.net
cdwan@bioteam.net
Local Alignments
• Good for finding short similar regions
(eg protein domains, motifs)

http://bioteam.net
cdwan@bioteam.net
Optimal Alignments
• Needleman & Wunsch and Smith-Waterman
• Exhaustive Search
• Alignment you get will have the best possible score
• Others may have the same score, but none better
• All pairs of sequences have an optimal alignment, whether or not they
are meaningful
• Slow

http://bioteam.net
cdwan@bioteam.net
Smith, Waterman (1981)
• Finds highest scoring region in common
• Uses a “Dynamic Programming” algorithm
• Compute time grows with the square of the length of the
sequences
• Example: Is ELVIS in the SEVENELEVEN?

http://bioteam.net
cdwan@bioteam.net
Pairwise Alignment Search
• Needleman & Wunsch (1970):
– Dynamic programming applied to global alignments
• Smith & Waterman (1981):
– Dynamic programming applied to “Local Dayhoff matrix alignments”
• Pearson et al. (1988): FASTA
• Altschul et. al. (1990): BLAST
– Heuristic approximations to Smith & Waterman allowing “reasonable” performance.
• Altschul et al. (1997): Gapped BLAST
– Further improvements to the BLAST algorithm

http://bioteam.net
cdwan@bioteam.net
Suboptimal Alignments
• Take shortcuts for sake of speed
• FASTA (Global or Local)
– Pearson and Lipman (1988)
• BLAST - Basic Local Alignment Search Tool
– Altschul, Gish, Miller, Myers and Lipman (1990)
– 10-100 times faster than regular Smith-Waterman
– Less accurate
– Today’s gold standard for searching large databases

http://bioteam.net
cdwan@bioteam.net
Why is alignment search complex?
• Perfect String Matching:
• Linear with length of strings
• … with gaps:
• Exponential (~1.5 power) with length of strings
• … seeking optimal sub-alignments:
• Exponential (~2.5 power) with length of strings
• … across an exponentially increasing set of
(potentially corrupt) data
• A whole new set of problems.

http://bioteam.net
cdwan@bioteam.net
The real problem?
• The problem is not response time on any single
step.
• The problems are
– Data management
– Throughput and updating results
– Biological Relevance
• We don’t need a faster alignment algorithm, we
need a better homology detector.

http://bioteam.net
cdwan@bioteam.net
Homology Search (ideal)
• Query:
• The thing about which you want information.
• Target:
• Any data at all, preferably all of it at once
• Results:
• Continually updated as new information is published, plus
exhaustive cross references.
• Clear distinction between lab verified and automatic annotations
• “Clickable” is good.

http://bioteam.net
cdwan@bioteam.net
BLAST
• Basic Local Alignment Search Tool
• Focus on local alignments
• important similarities are often confined
to small regions within larger sequences.
• BLAST is an heuristic algorithm:
• Finds exact matches quickly (linear time)
BLAST is the single most popular homology search
program (as of 2004)

http://bioteam.net
cdwan@bioteam.net
BLAST Search
BLAST Finds sequences that are “similar” to a query.
Sequences producing significant alignments: (bits) e-Value
gi|4165369|dbj|AB008577.1|AB008577 Bos taurus mRNA for MHC ... 993 0.0
gi|3688210|emb|AJ010861.1|BTAJ10861 Bos taurus MHC class I ... 961 0.0
…
gi|4106072|gb|AF055348.1|AF055348 Diceros bicornis minor cl... 549 e-154
…
gi|18699296|gb|AF464053.1| Sus scrofa MHC class I antigen (... 468 e-129
gi|188474|gb|M84694.1|HUMMHHLAB4 Human MHC class I HLA-B*40... 462 e-127
…

http://bioteam.net
cdwan@bioteam.net
The BLAST Heuristic
BLAST Heuristic:
To be eligible for consideration, a sequence pair must contain an ungapped
Maximal Scoring Pair (MSP) whose score exceeds some threshold.
Two stage process:
Find HSPs (linear time)
Generate Alignments, anchored by those HSPs.
caAACTGCTGaacgttgtcgtgagttctggctgcta--
--AACTGCTGggctctc-----ccgatcggctggcaaa
This throws away the vast majority (99% in a random sample) of sequences in the
target set.

http://bioteam.net
cdwan@bioteam.net
BLAST Search Spaces
Program Query Type Database Type Number
blastp Protein Protein 1x1
blastn Nucleotide Nucleotide 1x1
blastx Nucleotide* Protein 6x1
tblastn Protein Nucleotide* 1x6
tblastx Nucleotide* Nucleotide* 6x6
*Translated all 3 reading frames on both strands

http://bioteam.net
cdwan@bioteam.net
BLAST Scores
• “Score”
S = S(substitutions) – S(gaps)
• “Bit Score”
• Score, normalized for l and K, two parameters which should be left alone anyway,
and converted to something looking vaguely information theoretic.
Sn = [ lS - ln(K) ] / ln(2)
• “E-Value”
• “Expected number of hits of this score, in a target set of size n, with a query of
length m”
E = mn 2^Sn

http://bioteam.net
cdwan@bioteam.net
BLAST Search
• Bits: Large scores are good
• E-value: Small scores are good
Sequences producing significant alignments: (bits) e-Value
…
gi|4106072|gb|AF055348.1|AF055348 Diceros bicornis minor cl... 549 e-154
…
gi|18699296|gb|AF464053.1| Sus scrofa MHC class I antigen (... 468 e-129
gi|188474|gb|M84694.1|HUMMHHLAB4 Human MHC class I HLA-B*40... 462 e-127
…

http://bioteam.net
cdwan@bioteam.net
E-Value
2.71828182845904523536028747
• Unstable:
– Change every time the dataset grows.
• E-Values are not probabilities
– Yet people seem to treat them as though they are
• Rules of thumb:
– 10-30: A good, solid hit. Take it to the lab and verify it.
– 10-10: Okay. Base some further literature search on this.
– 1: Threshold of random chance
– 10: BLAST default cutoff

http://bioteam.net
cdwan@bioteam.net
“Low Complexity Regions”
• By default, BLAST filters out
regions of “Low Complexity” and
replaces them with “XXXXX” In
the alignments.
• This may or may not be what you
want.

http://bioteam.net
cdwan@bioteam.net
Potential Problems
• Round off errors
• Can fail the ‘diff’ test between 32 and 64 bit architectures
• “Silent” errors
• Check those logfiles.
• Parsing
• Please do not write another BLAST output parser.
• There are too many of them already in the world.
• Seriously.
• I’m not kidding about this one.
• Shadowing:
• Omission of interesting short hits in favor of less interesting but longer hits.

http://bioteam.net
cdwan@bioteam.net
BLAST Implementations
• NCBI BLAST
• NCBI Web Site
• NCBI command line tools
• Washington University BLAST
• (web based & command line)
• TIGR online searches
• Los Alamos National Lab
– MPI-BLAST
• TimeLogic Corporation
• “Tera-BLAST”
• Everyone else in the world…

http://bioteam.net
cdwan@bioteam.net
Is There A Parallel BLAST?
Yes.

http://bioteam.net
cdwan@bioteam.net
Multiple Sequence Alignment
• Given a family of related sequences, construct an
optimal multiple sequence alignment (MSA).
• Based on that MSA, construct models which can be
used to recognize as yet unrecognized members of
the set.

http://bioteam.net
cdwan@bioteam.net
Multiple Sequence Alignments
• Patterns
• Motifs
• Position Specific Scoring Matrixes
• Hidden Markov Models
• Neural Networks

http://bioteam.net
cdwan@bioteam.net
Danger Points
• No longer computing similarity to any single
observed sequence (what would they test in the
lab?)
• “Transitive Catastrophe”
• Statistical Starvation.

http://bioteam.net
cdwan@bioteam.net
Beware Intellectual Inbreeding
• Using known protein families, we compute costs for amino acid
substitutions.
• Using those costs, we search for potential homologies and new (putative)
families.
• Build statistical models based on putative protein families
• Rediscover known families with statistical techniques
• Does this provide independent confirmation?

http://bioteam.net
cdwan@bioteam.net
Example: ClustalW
• Align each sequence to each other sequence
• Select a seed alignment
• Build up a multiple alignment from the pieces
• Works great for close relatives

http://bioteam.net
cdwan@bioteam.net
Conserved Patterns
• Motifs:
– Conserved substrings in multiple alignments / sets of
sequences
• Position Specific Scoring Matrixes.
– Add “at each position in an alignment” to the work of
Dayhoff.

http://bioteam.net
cdwan@bioteam.net
What is HMMer?
• Written by Sean Eddy at Wash U
• Open Source
• 15 separate executables
• Build a statistical model of a multiple sequence alignment
• Search sequence databases with models
• Search model databases with sequences

http://bioteam.net
cdwan@bioteam.net
Search Shrimp for globin
• Build a HMM model from 50 globins
% hmmbuild globin.hmm globins50.msf
• Calibrate the model
% hmmcalibrate globin.hmm
• Search shrimp sequence database with model
% hmmsearch globin.hmm Artemia.fa
• Search model database with shrimp sequences
% hmmpfam globin.hmm Artemia.fa

http://bioteam.net
cdwan@bioteam.net
MSF Format…
DNA_MULTIPLE_ALIGNMENT 1.0
Three anthropoidea
MSF: 50 Type: N Check: 2666 ..
Name: Homo_sapiens Len: 50 Check: 8318 Weight: 1.00
Name: Pan_paniscus Len: 50 Check: 7854 Weight: 1.00
Name: Gorilla_gorilla Len: 50 Check: 7778 Weight: 1.00
//
Homo_sapiens AGUCGAGUC...GCAGAAAC
Pan_paniscus AGUCGCGUCG..GCAGAAAC
Gorilla_gorilla AGUCGCGUCG..GCAGAUAC
Homo_sapiens GCAUGAC.GACCACAUUUU.
Pan_paniscus GCAUGACGGACCACAUCAU.
Gorilla_gorilla GCAUCACGGAC.ACAUCAUC
Homo_sapiens CCUUGCAAAG
Pan_paniscus CCUUGCAAAG
Gorilla_gorilla CCUCGCAGAG

http://bioteam.net
cdwan@bioteam.net
hmm State Diagram

http://bioteam.net
cdwan@bioteam.net
hmm Format
HMMER2.0 [2.2g]
NAME globins50
LENG 148
ALPH Amino
RF no
CS no
MAP yes
COM ../binaries/hmmbuild globin.hmm globins50.msf
COM ../binaries/hmmcalibrate globin.hmm
NSEQ 50
DATE Thu Jul 25 10:51:38 2002
CKSUM 9858
XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4
NULT -4 -8455
NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142
-21 -313 45 531 201 384 -1998 -644
EVD -41.853970 0.212647
HMM A C D E F G H I K L M N
P Q R S T V W Y
m->m m->i m->d i->m i->i d->m d->d b->m m->e
-661 * -1444
1 77 -228 -1302 -1020 -730 -1034 -756 578 -803 -375 82 -791 -
1461 -720 -959 364 -94 2204 -1315 -857 9
- -149 -500 233 43 -381 399 106 -626 210 -466 -720 275
394 45 96 359 117 -369 -294 -249
- -39 -5807 -6849 -894 -1115 -701 -1378 -661 *

http://bioteam.net
cdwan@bioteam.net
One would expect wet-lab scientists to have a healthy skepticism of any
results, knowing how often experiments fail, and how much bad data has
made it out into the literature, but many seem to have an almost mystical
faith in anything produced by computation.
On the other hand, computational people seem to have an almost mystical
faith in wet-lab verification---expecting experiments to be neat, quick
deterministic tests like "if" statements in code.
- Gordon D. Pusch

http://bioteam.net
cdwan@bioteam.net
What can I do today?
• CS:
– Take biology coursework
– Accept that biology is really, really
complex and difficult.
• Bio:
– Take CS coursework
– Accept that computer engineering /
software development is tricky.
• Administrators:
– Decide to build a “spire, which will be
visible from afar”
• All:
– Attend Journal Clubs, symposia, etc.
– Get a bigger monitor

http://bioteam.net
cdwan@bioteam.net
Thank you

Intro bioinformatics

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Intro bioinformatics

Ähnlich wie Intro bioinformatics (20)

Mehr von Chris Dwan

Mehr von Chris Dwan (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Intro bioinformatics