Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Intro bioinformatics

343 Aufrufe

Veröffentlicht am

A short course I taught in 2004 about bioinformatics, focused on making it useful for CS / HPC people. Terribly dated at this point.

Veröffentlicht in: Wissenschaft
  • Loggen Sie sich ein, um Kommentare anzuzeigen.

Intro bioinformatics

  1. 1. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Genomic Biology and Bioinformatics The BioTeam
  2. 2. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net BioTeam™ Inc. • Objective & vendor neutral informatics and ‘bio-IT’ consulting • Composed of scientists who learned to bridge the gap between life science informatics and high performance IT • “iNquiry” bioinformatics cluster solution • Staff Michael Athanas Bill Van Etten Chris Dagdigian Stan Gloss Chris Dwan http://bioteam.net
  3. 3. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Goal of this session • Introduce major concepts in genetics, genomics, and bioinformatics. • Provide a minimal vocabulary to enable communication. • Enable communication between the disciplines Please ask questions
  4. 4. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Outline • Genetics to Genomics • Data formats & Resources • Sequence Analysis
  5. 5. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Goals • Build shared vocabulary, global view • Introduce online and text resources • Build interest Not: • Teaching molecular biology • Teaching bioinformatics
  6. 6. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Motivation for this session • Bioinformatics will be the major new application domain for High Performance Computing (HPC) applications over the next 50 years. • Life Scientists will walk into the computing center, wanting to work with you (or you will walk into their lab…) • No need to repeat old mistakes.
  7. 7. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net What is Bioinformatics? • http://bioinformatics.org/faq/#definitions – Computational Biology – Systems Biology – Genetics – Biology – *-omics • The application of high performance computing and data handling techniques to life sciences research • A major revenue stream, with lots of hype
  8. 8. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net
  9. 9. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Genome Sizes (in base pairs) • HIV (type 1) HIV 9,750 • Esceria Coli E. Coli 4x106 • Saccharomyces cerevisiae yeast 107 • Oryza Sativa rice 108 • Arabidopsis Thaliana “mouse-ear cress” 108 • Drosophila Melanogaster Fruit Fly 1.8x108 • Bos Taurus Cow 3x109 • Homo Sapiens Human 3x109 • Zea Mays corn 5x109 • Pinus resinosa Pine 7x1010 • Amoeba Dubia amoeba 6.7x1011
  10. 10. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net In context (Jan, 2004) • Complete genomes: ~800 • 19 eukaryotic • 16 archea • 64 bacteria • The rest: Viruses • Eukaryotes with at least one sequence in GenBank: • Between 50,000 and 100,0000 • Distinct Species • 1.4x106 uniquely named species • ~107 distinct species
  11. 11. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Genome Sizes (in base pairs)
  12. 12. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net What else could be bioinformatics? • Fold / Structure / Docking / Function predictions on proteins and bioactive molecules • Ontology building / literature searches / text mining / knowledge management • Image processing to support lab automation / data capture / experiment steering • Medical records integration with proteomic / transcript studies • Expert systems / AI / Clinical / Lab assistant • Virtual organizations, distributed databases, ad hoc expert conversations…
  13. 13. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net
  14. 14. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Suffixes • “ology”: • Biology, Physiology, Embryology, Terminology • Homology? Homo = same; logy = origin • “ics”: • Physics, Linguistics, Statistics, Bioinformatics • “ome”: • Proteome, Genome, Transcriptome, • Chromosome? Chromo = color; soma = body; • “ome-ics”: • Proteomics, genomics • Economics?
  15. 15. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Topics in Genomics • The Central Dogma • Levels of structure and interaction • The Chromosome Model • DNA Sequencing • Genome Assembly • Transcripts and Expression Levels • Protein Folding • Protein Interaction
  16. 16. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net What I want you to remember • Genotype vs. Phenotype • The Chromosome Model • The Central Dogma • Levels of Structure (primary -> quaternary) • Homology is boolean • It’s more complicated than they will admit (at first) • http://www.bioinformatics.org • http://www.ncbi.nih.gov Bioinformatics is Biology
  17. 17. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Real Question (July 15, 2002) “We have 10,000 BAC end reads from an organism with massive synteny to a model organism. We want to map markers from the model onto the putative homologs in the BAC clones so that we can do directed sequencing.”
  18. 18. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Example Question We have 10,000 BAC end reads from an organism with massive synteny to a model organism. We want to map markers in the model onto the putative homologs in the BAC clones so that we can do directed sequencing. • What is a BAC end read? How does it differ from a BAC clone? • What is a Homolog? Given that, what is a “putative” one? • What is “Synteny?” Is it different from homology? • What is a model organism? • What are “markers?” How can I best help this person?
  19. 19. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Real Question (May 30, 2002) “Tell me all the kinases which have a valine or an argenine within 2 angstroms of the active site.” • What is a kinase? • What are valine and argenine? • What is an active site?
  20. 20. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Why Put The Biology First? “Bioinformatics is full of pitfalls for those who look for patterns or make predictions without a thorough understanding of where biological data comes from and what it means” Nevin Young PhD Professor, University of Minnesota
  21. 21. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net A New Way of Thinking • "The new paradigm, now emerging, is that all genes will be known (in the sense of being resident in databases available electronically), and that the starting point of a biological investigation will be theoretical.” - Walter Gilbert, 1993 speculating on the nature of biology in the "post-genome era"
  22. 22. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Genetics to Genomics • 1600’s: Europe emerges from the dark ages • 1822 - 1884: Gregor Mendel • 1920’s: Genetic Mapping (Morgan) • 1952: DNA is Genetic Material (Hershey) • 1953: DNA Helix (W & C, Franklin) • 1966- Genetic Code (Nirenberg, Khorana) • 1977- DNA Sequenced (Sanger) • 1988- Human Genome Project Started • 2001- Human Genome Draft Finished
  23. 23. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Selective Breeding
  24. 24. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Francesco Redi: 1626-1697 • Prevailing Theory “Spontaneous Generation” – Meat makes maggots – Straw makes mice • Experiment: – Meat in two jars, one open one sealed. – Observe flies -> eggs -> maggots -> flies – nothing happens to the closed jar meat • Inference: Flies make flies. • Confirmed by Pasteur in mid 1800’s
  25. 25. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Science Marches On! • 1651 - William Harvey • Theory: “Ex Ovo Omnia” From the egg, everything! (No evidence whatsoever) • 1827 - Karl Ernst von Baer • First mammalian egg observed under a microscope. (dog) • 1868 - Friedrich Miescher • DNA (“Nuclein”) first observed. (Surgical bandages from soldiers) • 1875 - Oscar Hertwig • Observed that fertilization in both animals and plants consists of the physical union of the two nuclei contributed by the male and female parents. (Sea Urchin) • 1882: Walther Flemming • Observed chromosomes by staining cells at Meiosis (Salamander)
  26. 26. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Gregor Mendel (1822-1884) • Monk, Interested in math & gardening • Selectively bred pea plants – 28,000 plants over 7 years – 7 distinct phenotypic traits. • Published: 1866 • First Cited: 1900
  27. 27. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Why did Mendel succeed? • Studied one characteristic at a time: – Pea shape – Internal color – Seed-coat and flower color – pod shape – pod color – flower position – plant height • Kept pedigrees and made several generations of crosses • Kept track of numbers of progeny from each cross. Mendel was really, really lucky.
  28. 28. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Genotype vs. Phenotype • Genotype: – Properties (not necessarily observable) that can be passed on to offspring – DNA code and other genetic properties • Phenotype: – Observable traits of the organism – Things we can see Farmers have known this for a long time
  29. 29. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Mendelian Genetics • Genetic “factors” (genes) determine phenotypic traits. • Each organism has two instances (alleles) of each gene. • Independent assortment: One copy from from each parent is (selected at random) is passed on to each progeny.
  30. 30. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Cell Division • Mitosis: • “Ordinary” cell division • Start with 1 diploid cell • End with 2 diploid cells • No crossing over (or, if so, it doesn’t matter) • Meiosis: • “Gametogenesis” • Start with 1 diploid cell • End with 4 haploid gamete cells • Crossing over occurs (mechanism for independent assortment)
  31. 31. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net How was Mendel Lucky? Mendel was lucky because: • Peas are diploid • The traits he studied were all far apart on the chromosomes • He didn’t use a self fertilizing (or otherwise freakish) plant Mendel was unlucky because: • Despite being mostly correct, his paper was rejected by his journal of choice • He died before anyone discovered and cited his results • People now think that he must have cleaned his data.
  32. 32. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net “The Chromosome Model” With this model, we can look at the entire range of molecular biology, from chromosomes to base pairs. This is not a mechanism
  33. 33. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Chromosomes • Chromo = color • Soma = body • Chromosomes: – Colored (when stained) bodies that appear in the cell at mitosis and meiosis – Appear in pairs, except in gamete cells (sperm and ova), where they are single. – A good candidate for the location of genes
  34. 34. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Science Marches On! • 1902: Walter Sutton – Evidence that Mendel’s genetic factors exist on chromosomes (grasshoppers) Metaphase Spread Karyotype
  35. 35. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Number of (different) Chromosomes Chimpanzee 48 Cabbage 18 Camel 70 Chicken 78 Cat 34 Dog 78 Human 46 Corn 20 Alligator 32
  36. 36. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Chromosome Copies: “Ploidy” “Number of copies of each chromosome” • 2 = Diploid: – Humans (and the majority of other eukaryotes) • 4 = Tetraploid: – Pine Trees • 6 = Hexaploid: – ?? • 8 = Octoploid: – Starfish
  37. 37. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Thomas Morgan (1866-1945) • “The Fly Room” – Breeding experiments on Drosophila Melanogaster (Columbia University) • Alfred Sturtevant: – First Chromosome Map • Calvin Bridges: – Chromosome theory of Heredity • Hermann Muller: – Mutations can be induced by X-ray irradiation
  38. 38. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Why Model Organisms? • Fruitflies: • Only eight chromosomes. • Reproduce very quickly, with lots of offspring. • Tiny, so they don't take up a lot of room in the lab. • They don't need a whole lot of food to survive. • More Recently: • Small genome • Easily transformed • Numerous mutants • Well funded research community
  39. 39. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Some modern models • Drosophila Melanogaster • Mus Musculus • Anopheles Gambiae • Arabidopsis Thaliana • Medicago Truncatula • Oryza Sativa • Glycene Max • Zea Mays
  40. 40. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Prokaryotes vs. Eukaryotes • Viruses: (102 genes, 104 base pairs) • Prokaryote: (103 genes, 106 base pairs) • No Nucleus (Mostly bacteria) • No Introns (genes read continuously) • One circular chromosome • Genes clumped together in “operons” • Much simpler genetics. Also much harder to see. • Eukaryote: (104 genes, 109 base pairs) • Nucleated • Introns (Genes have untranslated “stuff” stuck in them) • Many, linear chromosomes • Genes spread out all over the place • Multi-cellular and therefore more interesting.
  41. 41. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Chromosome Mapping y3 – 12 y2 + 2y +4 = 0 Alfred Sturtevant was an undergraduate working in Morgan’s lab who (the story goes) set aside his algebra homework one night to create the first genetic map.
  42. 42. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Crossing Over
  43. 43. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Chromosome Mapping • Linked Genes: – Recombine less frequently than expected by Mendel’s law of independent assortment – Frequency of recombination  distance – Sturtevent called the unit of distance “map units” – Frequently referred to as “centiMorgans” after Dr. Morgan
  44. 44. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Crossing Over
  45. 45. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net A Genetic Map of Drosophila Note that we’re still not looking at DNA sequences.
  46. 46. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net DNA is the Genetic Material 1943: Oswald Avery et. al. sacrifice mice to demonstrate that DNA could be the material for genes. ( to one part in 6x108) 1952: Alfred Hershey and Martha Chase use viruses to prove it. “Perhaps we will be able to grind genes in a mortar and cook them in a beaker after all.” -Hermann Muller
  47. 47. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net “At the time it was believed that DNA was a stupid substance. A tetranucleotide which couldn’t do anything specific.” -Max Delbruck
  48. 48. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Nobel Milestones • 1953 - 3D Structure of DNA – Watson & Crick - model – Wilkins & Franklin -x-ray structure – Nobel in 1962
  49. 49. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net 1952: Watson & Crick Structure • Nucleotides – ‘A’ Adenine – ‘G’ Guanine – ‘C’ Cytosine – ‘T’ Thyamine “It has not escaped our attention that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material.” Watson & Crick, 1952
  50. 50. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Deoxyribonucleic Acid Chromosomes are long chains of nucleotides in complementary strands… ...AAACTGGAGCTCACCGCGGTGGCGGC... ...GGGTCAAGATCTGTTATAACAATAAT... Complementary single strands have strong affinity for each other: G pairs with A, T pairs with C.
  51. 51. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net “The Chromosome Model” With this model, we can look at the entire range of molecular biology, from chromosomes to base pairs.
  52. 52. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Nobel Milestones • 1959 – 3D Structure of a Protein – Perutz & Kendrew – structure of myoglobin & hemoglobin – Nobel in 1962
  53. 53. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Nobel Milestones • 1970’s – Nucleic Acid Chemistry – Paul Berg – recombinant DNA – Gilbert & Sanger – sequencing – Nobel in 1980
  54. 54. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sequencing • First DNA sequence published by Sanger, 1955 • Generate all possible subsequences from a fixed 5’ end (primer) • Sort them by weight • Read terminal nucleotide
  55. 55. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sanger Sequencing …AGTCCTG …AGTCCT …AGTCC …AGTC …AGT …AG …A G A T C •DNA of all possible lengths from a known starting point •Each strand ends with a radioactive “didioxy” nucleotide which terminates the chain •The strands are “weighed” using gel electrophoresis
  56. 56. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Modern Sequencing • Accomplished in a single capillary tube • Results read via a laser spectrometer • Accurate to ~700bp • Completely automated (~$0.04 / bp in 2003)
  57. 57. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sequence Data, Errors • Error rates for a single read = 0.002 • One error per read sequence, on average • Types of error: • Rare - Misreads • Common - Deletions / double-reads • Insertion of sequence from the vector • Contamination with human or E. Coli DNA • Quality tapers off at the end of a read
  58. 58. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Nucleotide Ambiguity Codes A = Adenine G = Guanine T = Thymine C = Cytocine R = A + G Y = C + T K = G + T M = A + C S = C + G W = A + T V = A + C + G B = C + G + T H = A + C + T D = A + G + T N = A + G + T + C I = hypoxanthine !(i/[GATCsn]+/)
  59. 59. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net “The Chromosome Model” With this model, we can look at the entire range of molecular biology, from chromosomes to base pairs.
  60. 60. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Restriction Enzymes • Cut DNA at a specific subtring (different for each restriction enzyme) …GGCTAGATTCCCTAGTTCGCTAATCGCT… |||||||||||||||||||||||||||| …CCGATCTAAGGGATCAAGCGATTAGCGA… Cut with “CTAGT” Restriction Enzyme …GGCTAGATTCCCTAGA TCGCTAATCGCT… ||||||||||| |||||||||||| …CCGATCTAAGG GATCTAGCGATTAGCGA… Sticky Ends
  61. 61. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Restriction Enzymes • “Cut” DNA only at a substring specific to the restriction enzyme. • Statistically, these substrings will occur several times along the length of a chromosome: Chromosome Cut Sites
  62. 62. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Vectors • Circular pieces of DNA with a cut site • Used to capture pieces of DNA Insertion site …GGCTAGATTCCCTAGA TCGCTAATCGCT… ||||||||||| |||||||||||| …CCGATCTAAGG GATCTAGCGATTAGCGA… Sticky Ends
  63. 63. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Modern vectors Many possible Insertion sites Gene coding for a brightly colored protein so we can visually distinguish vectors with inserts from those without Gene conveying resistance to ampicillin
  64. 64. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Making Insert Libraries • Separate out DNA from target organism • Use PCR to make lots of copies of the DNA • Cut with restriction enzymes, with vectors present in solution • Place vectors into e. coli cells • Spread vectorized e. coli onto agar plates • Let grow overnight on medium with ampicillin • Transfer only non-blue colonies into multi well plates (96 or 384). • Sequence all the wells. • What do you get after all this fun? Thousands of “clone libraries” in a freezer somewhere
  65. 65. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sizes of Insert Libraries • Phage Library: • 5 - 3,000bp • Bacterial Artificial Chromosome (BAC): • 80,000 - 100,000 bp • Yeast Artificial Chromosome (YAC): • 150,000 - 200,000 bp
  66. 66. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Restriction Enzymes Restriction Fragments • By controlling the relative amounts of DNA and restriction enzyme, we can produce a large set of smaller chromosome fragments
  67. 67. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net BAC End Sequences Restriction Fragments • It is “easy” to read the 700bp at each end of the insert libraries
  68. 68. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net How Many Fragments? • For a 5 letter (5-mer) restriction enzyme, odds of randomly hitting the target sequence are approximately: (1/4)5 = 1/1024 ≈ 10-3 • If a genome of interest is about 3x109 bp this gives us approximately: 3x106 segments • Using 3 or 4 unrealistic assumptions….
  69. 69. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Genome Sequencing: BAC Tiling • Directed BAC Sequencing – Read all BAC Ends & Fingerprints – Create the minimal tiling path to cover each chromosome – Sequence each BAC using smaller insert libraries (but the same basic idea) – Close Gaps (primer walking)
  70. 70. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Directed BAC Sequencing Minimum Tiling Path
  71. 71. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Shotgun Sequencing • Use inserts of approximately 1,000bp • No pre-processing or ordering, use computational techniques to assemble larger and larger fragments • Entirely automated • Works a lot better if someone else is doing BAC sequencing in the public domain
  72. 72. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Finishing a Genome • Sequence ought to be derived from a mixture of anonymous individuals • Hard to finish regions: – Telomere – Centromere – Highly variable regions • 10x coverage, 99% assembly • Standards vary by community.
  73. 73. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net We have a genome, now what? • Where are the genes? • How are genes controlled / activated? • Can we add to / subtract from the genome? • Why is there all that extra “junk” in there? • What genes are common between organisms?
  74. 74. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Topics in Genomics • The Central Dogma • Levels of structure and interaction • The Chromosome Model • DNA Sequencing • Genome Assembly • Transcripts and Gene Expression • Protein Folding • Protein Interaction
  75. 75. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Central Dogma DNA •Four Base Pairs: •GATC •Double Stranded •G->A •T->C •Packaged in Chromosomes RNA •T->U •Single Stranded •Mechanism for differential gene expression Amino Acid Chains •20 amino acids •“Genetic Code” translates 3 RNA to 1 amino acid Transcription Translation All disciplines should have the guts to admit to having a “central dogma”
  76. 76. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Levels of Structure • Primary Sequence • Secondary Local properties • Hydrophobic / hydrophilic regions. • a-Helices and b-sheets • Tertiary 3-d structure • Quaternary Interaction • Protein-protein interactions • post transcriptional modification • Enzymatic action • $$$$
  77. 77. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net What is a “gene?” • “The fundamental unit of genetic inheritance” • “One gene, one transcript” • One gene, one splice variant • “One gene, one protein” • “One gene, one heritable trait”
  78. 78. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Nobel Milestones • 1960’s – Genetic Code • Holley, Khorana and Nirenberg • Rosetta Stone of Life • Nobel in 1968
  79. 79. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net The Genetic Code
  80. 80. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Gamow and the Genetic Code
  81. 81. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Transcription & Translation …GATC… …CTAG…DNA …GAUC…mRNA Amino Acid Chain Transcription Translation (in one of six possible “Reading Frames”) …RIDVLKGEKALKASGLVP… Protein Folding Anthrax Toxin Delivery Factor
  82. 82. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Eukaryotic genes contain Introns
  83. 83. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net But wait, there’s more “Promoter” TATA “Start” ATG “Stop” TAA mRNA Splicing RNA DNA Introns (non coding regions) are removed AAA(A100+) Poly-A tail is attached Open Reading Frame (ORF) Six reading frames are possible
  84. 84. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Expression Level • “What protein is being made / which gene is being turned on when <your question here>?” • Can approximate this with mRNA levels. – Translation does not occur at a fixed rate – Proteins degrade at radically different rates – Some mRNA is never translated
  85. 85. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Expressed Sequence Tags 1. Select organism to study 2. Chop up organism into “libraries” representing interesting tissues, developmental stages, or experimental conditions. 3. Extract and sequence as many cDNAs as possible from each library. 4. Compare sequences to determine: • Tissue specific gene expression • Hypothetical functions for proteins • Expression levels (relative concentration in cytoplasm)
  86. 86. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Expressed Sequence Tags Cell 2. Use Reverse Transcriptase (poly-T primer) to create cDNA AAAAA(A100+) 1. Use Enzymes to digest DNA & Proteins, leaving mRNA TTTTT(T) 4. Sequence (via a complex procedure omitted here for the sake of brevity) the cDNA. 3. Capture the cDNA strand in vector and incorporate into E. Coli cells to replicate.
  87. 87. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net EST Data Reads of the same cDNA (product of the same gene) produce an assortment of sequences sharing the Poly-A 3’, and extending a random distance toward the 5’ end. Issues: • Sequence contamination with E. Coli, or vector • Spurious groupings of cDNA from different genes containing similar regions • Omission of genes due to low concentration or lack of expression (solve with additional libraries)
  88. 88. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net ESTs are Popular • Human: 4x109 sequences • Mouse: 2x109 sequences • Medicago Truncatula: 1.6x106 • Read only the genes which are being expressed • Get crude information about expression levels based on frequency of a certain sequence. • If a genome sequence is available, can locate genes on chromosomes using similarity search
  89. 89. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Southern Blot • Affix “target” single stranded sequence to a nylon membrane • Label “probe” single stranded sequences (mRNA from cells) with a fluorescent dye • Wash probe over target • Similar sequences will hybridize (stick together) • Check for fluorescence Target Probe Flourescent Label
  90. 90. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Micro / Macroarrays • Stick (hybridize) single stranded DNA to some surface (glass slide or nylon membrane) • Attach fluorescent markers to the single stranded “probe” control sample • Attach a different frequency of fluorescent marker to experimentally stressed probe sequences • Wash probes over targets. (like will stick to like) • Illuminate with laser and record differential frequency response
  91. 91. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Microarray Data
  92. 92. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net
  93. 93. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Gene Chips (2003) • 20bp sequences built using photolithography • Sequence must be known in advance • $200-$500 per “chip” from Affymetrix (and others) • Tools for data analysis also available for $$
  94. 94. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Microarrays vs Gene Chips • Microarrays • Cheap to create • No need to know sequences ahead of time (just use sample that is already in the freezer • Gene Chips • Initially expensive to create • All target sequences already known • “The mouse chip.” “The human chip”
  95. 95. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Time Course Experiments • At t=0, 5, 10, … from start of condition x • What genes are up and down regulated • What gene clusters seem to move together?
  96. 96. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Quality of Microarray data • Spot location • Spot size • Differential Hybridization • Errors in “swishing” of the probes • In general, only differences of 1s and above are significant.
  97. 97. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Aspects of Protein Structure 1 XMNFSGKYQV QSQENFEPFM KAMGLPEDLI QKGKDIKGVS EIVHEGKKVK 51 LTITYGSKVI HNEFTLGEEX ELETMTGEKV KAVVKMEGDN KMVTTFKGIK 101 SVTEFNGDTI TNTMTLGDIV YKRVSKRI
  98. 98. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Amino Acid Codes Alanine Ala A Arginine Arg R Asparagine Asn N Aspartic Acid Asp D Cysteine Cys C Glutamic Acid Glu E Glutamine Gln Q Glyceine Gly G Histidine His H Isoleucine Ile I Leucine Leu L Lysine Lys K Methionine Met M Phenylalanine Phe F Proline Pro P Serine Ser S Threonine Thr T Tryptophan Trp W Tyrosine Tyr Y Valine Val V Any Amino Acid:Z Unknown Amino Acid: X
  99. 99. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net A bit more about Alanine Molecular Structure CH3-CH(NH2)-COOH Molecular formula C3H7NO2 Molecular weight: 89.09 Isoelectric point (pH): 6.00 CAS Registry Number: 56-41-7
  100. 100. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Protein Structure is Difficult • There is, presently, no high throughput solution to determining protein structure • Crystal structure with X-Ray Crystallography • MALDI-TOF • Computational Techniques (not mature beyond secondary structure)
  101. 101. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Dangers of Protein Structures • If DNA sequences are cartoons… • Protein structures are even less than that. – Crystalline form (non biologically active) – Low temperature – No interactions with other molecules
  102. 102. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Massively parallel biology • Sequencing: – Large centers produce multiple megabases per day, run 24 by 7 • Expression: – Microarrays: 100,000 “spots” in parallel. – 1um diameter – Read with scanning laser – Petabytes of image data soon
  103. 103. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net
  104. 104. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Why the Explosion? http://www.sanger.ac.uk/Info/IT/
  105. 105. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net More… • Proteomics • Metabolomics • Single Nucleotide Polymorphism (SNP) • … • Biochemical pathway analysis • Protein - protein interaction • … • “Systems Biology”
  106. 106. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sequence Based Bioinformatics
  107. 107. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net “The Chromosome Model” With this model, we can look at the entire range of molecular biology, from chromosomes to base pairs. This is not a mechanism
  108. 108. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Levels of Structure (review) • Primary Sequence • Secondary Local properties • Hydrophobic / hydrophilic regions. • a-Helices and b-sheets • Tertiary 3-d structure • Quaternary Interaction • Protein-protein interactions • post transcriptional modification • Enzymatic action
  109. 109. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Homology is evolutionary relation • Homolog: – Related by descent. – This is a boolean property It is either true or false • Can Occur Via: – Duplication within a genome – Separation by descent.
  110. 110. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Other Terms • Synteny: – Genes share ordering between species • Ortholog: Related by speciation • Paralog: Related by duplication • Wet lab: Bubbling vats of goo • Dry lab: Whirring fans
  111. 111. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Comparative Genomics
  112. 112. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Phylogenetic Reconstruction
  113. 113. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Chromosome scale rearrangements Remarkable similarity between mouse and human chromosomes. But what does this picture mean? And how would we go about computing it? •Traditional gene maps? •Markers? •Sequence similarity? •A combination of the wet and dry lab?
  114. 114. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Genetic Database Collaboration • NCBI – National Center for Biotechnology Information – GenBank – http://www.ncbi.nlm.nih.gov • EBI – European Bioinformatics Institute – EMBL - European Molecular Biology Laboratory – http://www.ebi.ac.uk • CIB – Center for Information Biology – DDBJ - DNA Data Bank of Japan – http://www.ddbj.nig.ac.jp
  115. 115. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net International Collaboration NCBI CIB EBI Genbank DNA Databank of Japan EMBL Nucleotide Sequence Database Data are synchronized nightly between the three centers
  116. 116. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net National Center for Biotechnology Information • The genetic sequence database of the US National Institutes of Health • International Nucleotide Sequence Database Collaboration: – DNA DataBank of Japan (DDBJ) – European Molecular Biology Laboratory (EMBL) – GenBank • 2x1010 bases in 1.7x107 sequences • Release every two months, daily updates http://www.ncbi.nih.gov
  117. 117. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sequence Data Sets at NCBI • ‘NT’ • Nucleotide sequence dataset. • Quality standards include 7x read, 1x reverse • ‘NR’ • Non-redundant (cough cough…) • amino acid sequence dataset • ‘EST’ • Expressed Sequence Tag data • Low quality, different sort of data
  118. 118. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Transitive Catastrophe • Sequences of low quality are annotated by similarity to other sequences of low quality • This can build a corpus of erroneous data • Which will then be used to generate statistical models and faster algorithms • Which will be used to mis-annotate exponentially increasing volumes of data
  119. 119. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net More Sequence Data Sets • Protein Database (PDB): • Amino acid sequences for which a structure has been experimentally determined • SwissProt: • Amino acid sequences with a high level of annotation • Genomes: • All shapes and sizes
  120. 120. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Entrez (at NCBI) • PubMed: The biomedical literature (PubMed) • Nucleotide sequence database (Genbank) • Protein sequence database • Structure: three-dimensional macromolecular structures • Genome: complete genome assemblies • PopSet: population study data sets • OMIM: Online Mendelian Inheritance in Man • Taxonomy: organisms in GenBank • Books: online books • ProbeSet: gene expression and microarray datasets • 3D Domains: domains from Entrez Structure • UniSTS: markers and mapping data • SNP: single nucleotide polymorphisms • CDD: conserved domains
  121. 121. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Protein Structure Databases • PDB - Protein DataBank – Established in 1971 for protein structures – http://www.pdb.org – Now also includes nucleic acids, carbohydrates
  122. 122. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Protein Sequence Databases • PIR - Protein Information Resource – Protein Sequence Database (PIR-PSD) – Established in 1984 – http://pir.georgetown.edu/ Year Amino Acid Residues Sequence Records 1984 526,466 2,676 2001 76,174,552 219,241
  123. 123. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Protein Sequence Databases • SWISS-PROT – Established in 1986 – http://www.expasy.org/sprot/ – Try to distinguish themselves by • Annotation • Minimal redunancy • Integration with other databases
  124. 124. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net More Data Resources • The Institute for Genome Research (TIGR) – http://www.tigr.org • European Molecular Biology Institutes (EMBL) – http://www.embl.org • European Bioinformatics Institute (EBI) – http://www.ebi.org • SwissProt, Trembl: – http://www.expasy.ch
  125. 125. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Ensembl • EBI’s integrative genome data toolkit. • A web based tool in which data from various sources are associated with chromosome maps and locations. • http://www.embl.org
  126. 126. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Distributed Annotation System (DAS) • Client / Server system for publishing annotations to chromosomal data. • http://www.biodas.org • BioMOBY: Web Services genome annotation framework
  127. 127. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Protein Structures • SCOP: “Structural Classification of Proteins” – Superfamily – Family – Fold • CASP – Competition for protein structure prediction programs – Results are still lacking.
  128. 128. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Data types & Formats
  129. 129. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net FASTA Format >gi|4165369|dbj|AB008577.1|AB008577 Bos taurus mRNA for MHC class I heavy chain, partial cds, clone MP-5.10m AGGTATTTCCACACCGCCGTGTCTCGGCCCGGCCTCCGGGAGCCCCTCTTTATC ACGTCGGCTACGTGGACGACACGCAGTTCGTGCGGTTCGACAGCGACGCCCGGG ATCCGAGGAAAGAACCACGGCAGCCGTGGATGGAGAAGGAGGGGCCGGAGTATT GGGATCGCGAGACTCAAATCTCCAAGGAAAACGCACTGAAGTACCGAGAGGCCT TAACATCCTGCGCGGCTACTACAACCAGAGCGAGGCCGGGTCTCACACCTATCA GCGGATGTACGGCTGCGACGTGGGGCCGGACGGGCGCCTCCTCAGCGGGTTCAC GCAGTTCGGCTACGACGGCAGAGATTACATCGCCCTGAACGAGGACCTGCGCTC CGGACCGCGGCGGACACGGCGGCTCAGATCACCAAGCGCAAGTGGGAGGCGGCC GGTGAGGCGGAGAGATTCAGGAACTACGTGGAGGGCCGGTGCGTGGAGTGGCTC CGCAGATACCTG
  130. 130. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net FASTA Format • Definition line: • Required • starts with ‘>’ • contains no line breaks • Non-printing characters are frowned upon, but don’t break most tools. Ctrl-A is used by some organizations to combine deflines in Unigene sets • Data: • Unlimited nucleotide or amino acid sequence, possibly filled with whitespace and carriage returns. • Capitalization does not matter (unless it does) • FASTA files can (sometimes) be concatenated.
  131. 131. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net GenBank Entry LOCUS AB008577 501 bp mRNA linear MAM 22-JAN-1999 DEFINITION Bos taurus mRNA for MHC class I heavy chain, partial cds, clone MP-5.10m. ACCESSION AB008577 VERSION AB008577.1 GI:4165369 KEYWORDS MHC class I heavy chain. SOURCE Bos taurus (variety:Holstein, isolate:MP-5) cultured T cells cDNA to mRNA, clone:MP-5.10m. ORGANISM Bos taurus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Cetartiodactyla; Ruminantia; Pecora; Bovoidea; Bovidae; Bovinae; Bos. REFERENCE 1 (bases 1 to 501) AUTHORS Urakawa,T., Kodama,M., Morita,M. and Ikeda,H. TITLE Direct Submission JOURNAL Submitted (02-NOV-1997) Toyohiko Urakawa, STAFF Institute, 2nd Division; 446-1 Ippaizuka, Kamiyokoba, Tsukuba, Ibaraki 305, Japan (E- mail:urakawa@gene.staff.or.jp, Tel:+81-298-38-7757, Fax:+81-298-38-7880)
  132. 132. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Fun facts about GenBank • Accession: • Unique ID for this sequence: AB008577 • Version: • Incremented with each update: AB008577.1 • GI: • Old version of Accession • Taxonomy ID: • Link into NCBI’s Taxonomy tree Only original authors can update data
  133. 133. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net GenBank Entry FEATURES Location/Qualifiers /organism="Bos taurus“ /variety="Holstein“ /isolate="MP-5“ /db_xref="taxon:9913“ /clone="MP-5.10m“ /cell_type="cultured T cells“ /note="BoLA class I haplotype (A8A14/A6A19); Common E group; RT-PCR amplified clone" CDS <1..>501 /standard_name="MHC class I related gene“ /note="particial alpha 1and 2 domains“ /codon_start=1 /product="MHC class I heavy chain“ /protein_id="BAA37151.1“ /db_xref="GI:4165370“ /translation="RYFHTAVSRPGLREPLFITVGYVDDTQFVRFDSDARDPRKEPRQ PWMEKEGPEYWDRETQISKENALKYREALNILRGYYNQSEAGSHTYQRMYGCDVGPDG RLLSGFTQFGYDGRDYIALNEDLRSWTAADTAAQITKRKWEAAGEAERFRNYVEGRCV EWLRRYL“
  134. 134. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net GenBank Entry BASE COUNT 105 a 148 c 173 g 75 t ORIGIN 1 aggtatttcc acaccgccgt gtctcggccc ggcctccggg agcccctctt tatcaccgtc 61 ggctacgtgg acgacacgca gttcgtgcgg ttcgacagcg acgcccggga tccgaggaaa 121 gaaccacggc agccgtggat ggagaaggag gggccggagt attgggatcg cgagactcaa 181 atctccaagg aaaacgcact gaagtaccga gaggccttga acatcctgcg cggctactac 241 aaccagagcg aggccgggtc tcacacctat cagcggatgt acggctgcga cgtggggccg 301 gacgggcgcc tcctcagcgg gttcacgcag ttcggctacg acggcagaga ttacatcgcc 361 ctgaacgagg acctgcgctc ctggaccgcg gcggacacgg cggctcagat caccaagcgc 421 aagtgggagg cggccggtga ggcggagaga ttcaggaact acgtggaggg ccggtgcgtg 481 gagtggctcc gcagatacct g
  135. 135. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Ways to access data at NCBI • http://www.ncbi.nih.gov • Can use ENTREZ to define fairly unique sets of sequences and download in batch • ftp.ncbi.nih.gov:/blast/db • Download the entire 15GB set of datasets • http://www.bioperl.org • Perl routines for automating small data retrieval jobs.
  136. 136. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net NCBI Supported Formats • ASCII GenBank Record • FASTA • ASN.1 • XML
  137. 137. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net More file formats • Chromatogram: • Binary output of an automated sequencer • Phd / phred / quality file: • ASCII file combining bases and quality values. • ASN.1: • Binary representation of GenBank entries • C and C++ libraries for accessing ASN.1 are maintained by NCBI
  138. 138. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sequence Handling Tasks • Base calling – Chromatogram -> FASTA • Sequence Cleaning – Search for contamination – Vector – host DNA – other common sequencing artifacts. • Contig Assembly • Genomic Assembly
  139. 139. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Unigene Sets • Contigging: – In EST projects, cDNA reads which are believed to originate from the same mRNA transcript are associated into contiguous segments. – Sets of these contigged (consensus) sequences are sometimes called “Unigene Sets.” – Programs for doing this include: • phrap • TIGR Assembler • Consed • Arachne
  140. 140. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Genomic Assembly • Genomic Assembly: – A time and labor intensive process by which gaps in the genomic sequence are identified, primer pairs are constructed to target those gaps, and additional sequencing is performed. – There is no general solution to this, nor will there be.
  141. 141. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Microarray Analysis • Data Management: – GeneSpring and others: Web front end to an annotation database for microarray informatio • Analysis: – Normalization – Synthetic experiment design
  142. 142. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Biochemical Pathway Analysis • Kyoto Encyclopedia of Genes and Genomes • http://www.genome.jp/kegg/
  143. 143. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sequence Analysis
  144. 144. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sequence Anaylsis Overview • Properties of individual sequences • Sequence alignment • Alignment based search (BLAST) • Multiple Sequence Alignment • Motifs / etc. • Statistical models / model based search
  145. 145. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Amino Acid Properties
  146. 146. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Similar Amino Acids Tyrosine (Y) Phenylalanine (F)
  147. 147. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Similar Amino Acids Aspartate (D)Glutamate (E)
  148. 148. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Examples from EMBOSS • Pepstats • Charge • Compseq • Pepwindow
  149. 149. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Comparing Sequences >NXCI_115_B04_F 544 0 544 ABI GTGGTAAAACTGGAGCTCACCGCGGTGGCGGCCGCTCT ANAACTAGTGGATCCCCCGGGCTGCAGGAATTCGGCAC GAGATTTTGACAGACATGAGCTCATATGCAGATGCTTT GCGTGAAGTGTCTGCAGCTCGTGAAGAAGTGCCTGGCC GACGTGGTTATCCTGGGTACATGTATACTGACTTGGCA ACGATTTATGAACGGGCAGGACGTATTGAAGGCCGAAA AGGCTCTATTACTCAGATTCCCATTCTGACCATGCCCA ATGATGATATTACACACCCAATTCCAGATCTAACAGGT TACATCACAGAAGGGCAGATATATATTGACAGGCAACT TCATATCGACAGATATACCCACCAATCAATGTTCTTCC ATCTCTATCACGATTGATGAAGAGTGCTATAGGGGAGG GAATGACTCGACGGGATCATGCTGAAGTTTCAAATCAG CTATAGCAAATTATGCAATTGGAAAGGATGTACAAGCA ATGAAGGCTGTGGTTGGAGAGGAGGCCTTGTCATCAGA GGATCTGCTG >gi|4165369|dbj|AB008577.1|AB008577 Bos taurus mRNA for MHC class I heavy chain, partial cds, clone MP-5.10m AGGTATTTCCACACCGCCGTGTCTCGGCCCGGCCTCCGGGAG CCCCTCTTTATCACGTCGGCTACGTGGACGACACGCAGTTCG TGCGGTTCGACAGCGACGCCCGGGATCCGAGGAAAGAACCAC GGCAGCCGTGGATGGAGAAGGAGGGGCCGGAGTATTGGGATC GCGAGACTCAAATCTCCAAGGAAAACGCACTGAAGTACCGAG AGGCCTTAACATCCTGCGCGGCTACTACAACCAGAGCGAGGC CGGGTCTCACACCTATCAGCGGATGTACGGCTGCGACGTGGG GCCGGACGGGCGCCTCCTCAGCGGGTTCACGCAGTTCGGCTA CGACGGCAGAGATTACATCGCCCTGAACGAGGACCTGCGCTC CGGACCGCGGCGGACACGGCGGCTCAGATCACCAAGCGCAAG TGGAGGCGGCCGGTGAGGCGGAGAGATTCAGGAACTACGTGG AGGGCCGGTGCGTGGAGTGGCTCCGCAGATACCTG
  150. 150. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sequence Alignment HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL HBB_HUMAN GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL HBA_HUMAN GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL ++ ++++H+ KV + +A ++ +L+ L+++H+ K LGB2_LUPLU NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL GS+ + G + +D L ++ H+ D+ A +AL D ++AH+ F11G11.2 GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFKAHQE
  151. 151. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sequence Alignment, Fact 1 “In biomolecular sequences (DNA, RNA, or amino acid sequences), high sequence similarity usually implies significant functional or structural similarity.” Dan Gusfield Algorithms on Strings, Trees, and Sequences. 1997. University of Cambridge Press. p.212.
  152. 152. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sequence Alignment, Fact 2 “Evolutionary and functionally related molecular strings can differ significantly throughout much of the string and yet preserve the same three-dimensional structure(s), or the same two dimensional substructure(s) (motifs, domains), or the same active sites, or the same or related dispersed residues (DNA or amino acid).” Dan Gusfield. Algorithms on Strings, Trees, and Sequences. 1997. University of Cambridge Press. p.334
  153. 153. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sequence Alignment • Why do sequences appear similar? – common ancestry – common function – chance • Terms – Identity - identical matches – Similarity - common properties – Homolog - common ancestor (related by descent) • Paralog - same species, different copy / function • Ortholog - same function, different species
  154. 154. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Doolittle’s Twilight Zone • Point at which two sequences may appear to be related based only on random chance
  155. 155. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Dottup Example
  156. 156. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sequence Alignment • Aligning two sequences: – Insert a minimum number of gaps into one or both sequences to maximize matches DDLMLSPDDLAQWLTEDPGPSEAPRMSE |||:| | |: :: ||||| |:| DDLLL-PQDVEEFF---EGPSEALRVSG
  157. 157. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Sequence Alignments • Matches may be identical • Matches may include similar but not identical properties DDLMLSPDDLAQWLTEDPGPSEAPRMSE |||:| | |: :: ||||| |:| DDLLL-PQDVEEFF---EGPSEALRVSG DDLMLSPDDLAQWLTEDPGPSEAPRMSE |||:| | |: :: ||||| |:| DDLLL-PQDVEEFF---EGPSEALRVSG
  158. 158. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Evolution of String Comparison • Hamming Distance (1951): The number of locations at which the two (binary) strings of equal length differ. • Levenshtein Distance (1961): The number of single character insertions, deletions, or substitutions (edits) required to transform one sequence into another. • “Substitution Matrices” (Dayhoff, 1978): Use of a Substitution Matrix to encode log likelihoods of substitutions. • “Gapped Alignments” (Many authors, 1980+): Mathematical models for allowing gaps in alignments • “Statistical Models” (Many authors, 1982+): No longer aligning against a specific string, but against the compiled statistics of sets of strings.
  159. 159. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Hamming Distance (1950’s) Count of the differences between two sequences of identical length 53/55 identical ctggagctcaccgcggtggcggccgctcta |||||||||||||||||||||||||||||| ctggagctcaccgcggtggcggccgctcta 49/55 identical gtaaagcccaccgcggtggcggccgctcta | ||| |||||||||||||||||||||| ctggagctcaccgcggtggcggccgctcta
  160. 160. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Substitution Matrixes • Margaret Dayhoff (1925-1983): • “Percent Accepted Mutation” (PAM) 1973 • Substitution frequencies from “real” alignments of known homologs, normalized to some percent mutation rate. • 1300 sequences, 72 families, closely related within families • PAMij = 10(log10Rij) • Rij = freq of (i -> j) / freq(i) • PAM n = (PAM1)n
  161. 161. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net PAM 250 A R N D C Q E G H I L K M F P S T W Y V B Z X * A 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 0 0 0 -8 R -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 -1 0 -1 -8 N 0 0 2 2 -4 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 2 1 0 -8 D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 3 3 -1 -8 C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -4 -5 -3 -8 Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 1 3 -1 -8 E 0 -1 1 3 -5 2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 3 3 -1 -8 G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 0 0 -1 -8 H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 1 2 -1 -8 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -2 -2 -1 -8 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -3 -3 -1 -8 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 1 0 -1 -8 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -2 -2 -1 -8 F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -4 -5 -2 -8 P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 -1 0 -1 -8 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -2 -3 -1 0 0 0 -8 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 0 -1 0 -8 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -5 -6 -4 -8 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 -2 -3 -4 -2 -8 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 -2 -2 -1 -8 B 0 -1 2 3 -4 1 3 0 1 -2 -3 1 -2 -4 -1 0 0 -5 -3 -2 3 2 -1 -8 Z 0 0 1 3 -5 3 3 0 2 -2 -3 0 -2 -5 0 0 -1 -6 -4 -2 2 3 -1 -8 X 0 -1 0 -1 -3 -1 -1 -1 -1 -1 -1 -1 -1 -2 -1 0 0 -4 -2 -1 -1 -1 -1 -8 * -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 1
  162. 162. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net BLOcks Substitution Matrix (BLOSUM) • Steven Henikoff, 1989 • Calculated frequency of substitutions in conserved motifs, rather than across the global alignments.
  163. 163. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Scoring gapped alignments • Fixed cost to open a gap • Weighted (affine) cost to increase an existing gap. • Models biological events better than a fixed cost • To score one alignment: – Sum substitution scores and gap costs. • To find the best possible alignment: – Calculate score for all possible alignments – Pick the best one.
  164. 164. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Global Alignments • May miss conserved domains/motifs
  165. 165. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Local Alignments • Good for finding short similar regions (eg protein domains, motifs)
  166. 166. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Optimal Alignments • Needleman & Wunsch and Smith-Waterman • Exhaustive Search • Alignment you get will have the best possible score • Others may have the same score, but none better • All pairs of sequences have an optimal alignment, whether or not they are meaningful • Slow
  167. 167. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Smith, Waterman (1981) • Finds highest scoring region in common • Uses a “Dynamic Programming” algorithm • Compute time grows with the square of the length of the sequences • Example: Is ELVIS in the SEVENELEVEN?
  168. 168. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Pairwise Alignment Search • Needleman & Wunsch (1970): – Dynamic programming applied to global alignments • Smith & Waterman (1981): – Dynamic programming applied to “Local Dayhoff matrix alignments” • Pearson et al. (1988): FASTA • Altschul et. al. (1990): BLAST – Heuristic approximations to Smith & Waterman allowing “reasonable” performance. • Altschul et al. (1997): Gapped BLAST – Further improvements to the BLAST algorithm
  169. 169. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Suboptimal Alignments • Take shortcuts for sake of speed • FASTA (Global or Local) – Pearson and Lipman (1988) • BLAST - Basic Local Alignment Search Tool – Altschul, Gish, Miller, Myers and Lipman (1990) – 10-100 times faster than regular Smith-Waterman – Less accurate – Today’s gold standard for searching large databases
  170. 170. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Why is alignment search complex? • Perfect String Matching: • Linear with length of strings • … with gaps: • Exponential (~1.5 power) with length of strings • … seeking optimal sub-alignments: • Exponential (~2.5 power) with length of strings • … across an exponentially increasing set of (potentially corrupt) data • A whole new set of problems.
  171. 171. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net The real problem? • The problem is not response time on any single step. • The problems are – Data management – Throughput and updating results – Biological Relevance • We don’t need a faster alignment algorithm, we need a better homology detector.
  172. 172. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Homology Search (ideal) • Query: • The thing about which you want information. • Target: • Any data at all, preferably all of it at once • Results: • Continually updated as new information is published, plus exhaustive cross references. • Clear distinction between lab verified and automatic annotations • “Clickable” is good.
  173. 173. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net BLAST • Basic Local Alignment Search Tool • Focus on local alignments • important similarities are often confined to small regions within larger sequences. • BLAST is an heuristic algorithm: • Finds exact matches quickly (linear time) BLAST is the single most popular homology search program (as of 2004)
  174. 174. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net BLAST Search BLAST Finds sequences that are “similar” to a query. Sequences producing significant alignments: (bits) e-Value gi|4165369|dbj|AB008577.1|AB008577 Bos taurus mRNA for MHC ... 993 0.0 gi|3688210|emb|AJ010861.1|BTAJ10861 Bos taurus MHC class I ... 961 0.0 gi|2864714|dbj|AB008598.1|AB008598 Bos taurus mRNA for MHC ... 882 0.0 gi|2864712|dbj|AB008597.1|AB008597 Bos taurus mRNA for MHC ... 827 0.0 gi|3688212|emb|AJ010862.1|BTAJ10862 Bos taurus MHC class I ... 803 0.0 gi|2864815|dbj|AB008649.1|AB008649 Bos taurus mRNA for MHC ... 783 0.0 … gi|4106072|gb|AF055348.1|AF055348 Diceros bicornis minor cl... 549 e-154 … gi|18699296|gb|AF464053.1| Sus scrofa MHC class I antigen (... 468 e-129 gi|188474|gb|M84694.1|HUMMHHLAB4 Human MHC class I HLA-B*40... 462 e-127 …
  175. 175. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net The BLAST Heuristic BLAST Heuristic: To be eligible for consideration, a sequence pair must contain an ungapped Maximal Scoring Pair (MSP) whose score exceeds some threshold. Two stage process: Find HSPs (linear time) Generate Alignments, anchored by those HSPs. caAACTGCTGaacgttgtcgtgagttctggctgcta-- --AACTGCTGggctctc-----ccgatcggctggcaaa This throws away the vast majority (99% in a random sample) of sequences in the target set.
  176. 176. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net BLAST Search Spaces Program Query Type Database Type Number blastp Protein Protein 1x1 blastn Nucleotide Nucleotide 1x1 blastx Nucleotide* Protein 6x1 tblastn Protein Nucleotide* 1x6 tblastx Nucleotide* Nucleotide* 6x6 *Translated all 3 reading frames on both strands
  177. 177. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net BLAST Scores • “Score” S = S(substitutions) – S(gaps) • “Bit Score” • Score, normalized for l and K, two parameters which should be left alone anyway, and converted to something looking vaguely information theoretic. Sn = [ lS - ln(K) ] / ln(2) • “E-Value” • “Expected number of hits of this score, in a target set of size n, with a query of length m” E = mn 2^Sn
  178. 178. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net BLAST Search • Bits: Large scores are good • E-value: Small scores are good Sequences producing significant alignments: (bits) e-Value gi|4165369|dbj|AB008577.1|AB008577 Bos taurus mRNA for MHC ... 993 0.0 gi|3688210|emb|AJ010861.1|BTAJ10861 Bos taurus MHC class I ... 961 0.0 gi|2864714|dbj|AB008598.1|AB008598 Bos taurus mRNA for MHC ... 882 0.0 gi|2864712|dbj|AB008597.1|AB008597 Bos taurus mRNA for MHC ... 827 0.0 gi|3688212|emb|AJ010862.1|BTAJ10862 Bos taurus MHC class I ... 803 0.0 gi|2864815|dbj|AB008649.1|AB008649 Bos taurus mRNA for MHC ... 783 0.0 … gi|4106072|gb|AF055348.1|AF055348 Diceros bicornis minor cl... 549 e-154 … gi|18699296|gb|AF464053.1| Sus scrofa MHC class I antigen (... 468 e-129 gi|188474|gb|M84694.1|HUMMHHLAB4 Human MHC class I HLA-B*40... 462 e-127 …
  179. 179. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net E-Value 2.71828182845904523536028747 • Unstable: – Change every time the dataset grows. • E-Values are not probabilities – Yet people seem to treat them as though they are • Rules of thumb: – 10-30: A good, solid hit. Take it to the lab and verify it. – 10-10: Okay. Base some further literature search on this. – 1: Threshold of random chance – 10: BLAST default cutoff
  180. 180. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net “Low Complexity Regions” • By default, BLAST filters out regions of “Low Complexity” and replaces them with “XXXXX” In the alignments. • This may or may not be what you want.
  181. 181. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Potential Problems • Round off errors • Can fail the ‘diff’ test between 32 and 64 bit architectures • “Silent” errors • Check those logfiles. • Parsing • Please do not write another BLAST output parser. • There are too many of them already in the world. • Seriously. • I’m not kidding about this one. • Shadowing: • Omission of interesting short hits in favor of less interesting but longer hits.
  182. 182. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net BLAST Implementations • NCBI BLAST • NCBI Web Site • NCBI command line tools • Washington University BLAST • (web based & command line) • TIGR online searches • Los Alamos National Lab – MPI-BLAST • TimeLogic Corporation • “Tera-BLAST” • Everyone else in the world…
  183. 183. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Is There A Parallel BLAST? Yes.
  184. 184. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Multiple Sequence Alignment • Given a family of related sequences, construct an optimal multiple sequence alignment (MSA). • Based on that MSA, construct models which can be used to recognize as yet unrecognized members of the set.
  185. 185. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Multiple Sequence Alignments • Patterns • Motifs • Position Specific Scoring Matrixes • Hidden Markov Models • Neural Networks
  186. 186. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Danger Points • No longer computing similarity to any single observed sequence (what would they test in the lab?) • “Transitive Catastrophe” • Statistical Starvation.
  187. 187. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Beware Intellectual Inbreeding • Using known protein families, we compute costs for amino acid substitutions. • Using those costs, we search for potential homologies and new (putative) families. • Build statistical models based on putative protein families • Rediscover known families with statistical techniques • Does this provide independent confirmation?
  188. 188. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Example: ClustalW • Align each sequence to each other sequence • Select a seed alignment • Build up a multiple alignment from the pieces • Works great for close relatives
  189. 189. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Conserved Patterns • Motifs: – Conserved substrings in multiple alignments / sets of sequences • Position Specific Scoring Matrixes. – Add “at each position in an alignment” to the work of Dayhoff.
  190. 190. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net What is HMMer? • Written by Sean Eddy at Wash U • Open Source • 15 separate executables • Build a statistical model of a multiple sequence alignment • Search sequence databases with models • Search model databases with sequences
  191. 191. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Search Shrimp for globin • Build a HMM model from 50 globins % hmmbuild globin.hmm globins50.msf • Calibrate the model % hmmcalibrate globin.hmm • Search shrimp sequence database with model % hmmsearch globin.hmm Artemia.fa • Search model database with shrimp sequences % hmmpfam globin.hmm Artemia.fa
  192. 192. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net MSF Format… DNA_MULTIPLE_ALIGNMENT 1.0 Three anthropoidea MSF: 50 Type: N Check: 2666 .. Name: Homo_sapiens Len: 50 Check: 8318 Weight: 1.00 Name: Pan_paniscus Len: 50 Check: 7854 Weight: 1.00 Name: Gorilla_gorilla Len: 50 Check: 7778 Weight: 1.00 // Homo_sapiens AGUCGAGUC...GCAGAAAC Pan_paniscus AGUCGCGUCG..GCAGAAAC Gorilla_gorilla AGUCGCGUCG..GCAGAUAC Homo_sapiens GCAUGAC.GACCACAUUUU. Pan_paniscus GCAUGACGGACCACAUCAU. Gorilla_gorilla GCAUCACGGAC.ACAUCAUC Homo_sapiens CCUUGCAAAG Pan_paniscus CCUUGCAAAG Gorilla_gorilla CCUCGCAGAG
  193. 193. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net hmm State Diagram
  194. 194. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net hmm Format HMMER2.0 [2.2g] NAME globins50 LENG 148 ALPH Amino RF no CS no MAP yes COM ../binaries/hmmbuild globin.hmm globins50.msf COM ../binaries/hmmcalibrate globin.hmm NSEQ 50 DATE Thu Jul 25 10:51:38 2002 CKSUM 9858 XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455 NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -41.853970 0.212647 HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -661 * -1444 1 77 -228 -1302 -1020 -730 -1034 -756 578 -803 -375 82 -791 - 1461 -720 -959 364 -94 2204 -1315 -857 9 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -39 -5807 -6849 -894 -1115 -701 -1378 -661 *
  195. 195. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net What else could be bioinformatics? • Fold / Structure / Docking / Function predictions on proteins and bioactive molecules • Ontology building / literature searches / text mining / knowledge management • Image processing to support lab automation / data capture / experiment steering • Medical records integration with proteomic / transcript studies • Expert systems / AI / Clinical / Lab assistant • Virtual organizations, distributed databases, ad hoc expert conversations…
  196. 196. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net One would expect wet-lab scientists to have a healthy skepticism of any results, knowing how often experiments fail, and how much bad data has made it out into the literature, but many seem to have an almost mystical faith in anything produced by computation. On the other hand, computational people seem to have an almost mystical faith in wet-lab verification---expecting experiments to be neat, quick deterministic tests like "if" statements in code. - Gordon D. Pusch
  197. 197. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net What can I do today? • CS: – Take biology coursework – Accept that biology is really, really complex and difficult. • Bio: – Take CS coursework – Accept that computer engineering / software development is tricky. • Administrators: – Decide to build a “spire, which will be visible from afar” • All: – Attend Journal Clubs, symposia, etc. – Get a bigger monitor
  198. 198. © 2004: The BioTeam http://bioteam.net cdwan@bioteam.net Thank you

×