Anzeige
Anzeige

Más contenido relacionado

Anzeige

02.cb1.ppt

  1. 3/24/2023 ©Bud Mishra, 2001 L2-1 Computational Biology Lecture #2: Genome Organization Bud Mishra Professor of Computer Science and Mathematics 9 ¦ 24 ¦ 2001 
  2. 3/24/2023 ©Bud Mishra, 2001 L2-2 Active Areas of Research(1) • Human Genome Project: (Completed?) – Read 3 billion base pairs in 46 human chromosomes – Deemed “substantially completed on June 27, 2000.” • Polymorphisms and Haplotyping – SNPs (Single Nucleotide Polymorphisms): Catalog the single base pair variations occurring about 1 in 800 base pairs of human genome over the entire populations – RFLP-Map: Restriction Fragment Length Polymorphisms
  3. 3/24/2023 ©Bud Mishra, 2001 L2-3 Active Areas of Research(2) • Transcription Maps: – Identify all (about 30,000 (?)) the genes in the human genome. – Particularly interesting are the ones involved in cancer…About 100 oncogenes and 1000 tumor suppressor genes • Linkage Analysis: – Relate genes (or polymorphic markers) to phenotypes (externally observable traits) by analyzing genomes of a family (kinship) or over a population.
  4. 3/24/2023 ©Bud Mishra, 2001 L2-4 Active Areas of Research(3) • Functional Genomics: – Understand how an interactive network of genes affect a chain of metabolic pathways to ultimately determine the phenotypes • Comparative Genomics: – Relate genes within and across species to understand their evolutionary relationship…Phylogeny.
  5. 3/24/2023 ©Bud Mishra, 2001 L2-5 Active Areas of Research(4) • Cell Informatics: – Interaction between proteins (membrane and soluble ones) to determine the dynamics of a cell. – Interaction among a heterogeneous population of cells. • Rational Drug Design: – Design of drugs and delivery systems to modify the dynamics of the cells.
  6. 3/24/2023 ©Bud Mishra, 2001 L2-6 Introduction to Biology • Genome: – Hereditary information of an organism is encoded in its DNA and enclosed in a cell (unless it is a virus). All the information contained in the DNA of a single organism is its genome. • DNA molecule can be thought of as a very long sequence of nucleotides or bases: S = {A, T, C, G}
  7. 3/24/2023 ©Bud Mishra, 2001 L2-7 Complementarity • DNA is a double-stranded polymer and should be thought of as a pair of sequences over S. However, there is a relation of complementarity between the two sequences: – A , T, C , G – That is if there is an A (respectively, T, C, G) on one sequence at a particular position then the other sequence must have a T (respectively, A, G, C) at the same position. • We will measure the sequence length (or the DNA length) in terms of base pairs (bp): for instance, human (H. sapiens) DNA is 3.3 £ 109 bp measuring about 6 ft of DNA polymer completely stretched out!
  8. 3/24/2023 ©Bud Mishra, 2001 L2-8 The Central Dogma • The intermediate molecule carrying the information out of the nucleus of an eukaryotic cell is RNA, a single stranded polymer. • RNA also controls the translation process in which amino acids are created making up the proteins. • The central dogma(due to Francis Crick in 1958) states that these information flows are all unidirectional: “The central dogma states that once `information' has passed into protein it cannot get out again. The transfer of information from nucleic acid to nucleic acid, or from nucleic acid to protein, may be possible, but transfer from protein to protein, or from protein to nucleic acid is impossible. Information means here the precise determination of sequence, either of bases in the nucleic acid or of amino acid residues in the protein.”
  9. 3/24/2023 ©Bud Mishra, 2001 L2-9 Interrupted Genes: • An open reading frame (containing a gene) consists of – INTRONS: Intervening sequences a Noncoding regions – EXONS: Protein coding regions • Introns are abundant in eukaryotes and certain animal viruses.
  10. 3/24/2023 ©Bud Mishra, 2001 L2-10 Interrupted Genes: Intron1 Intron2 Intron3 Exon1 Exon2 Transcription Splicing DNA RNA Primary transcript mRNA
  11. 3/24/2023 ©Bud Mishra, 2001 L2-11 Interrupted Genes: • Introns can occur between individual codons or within a single codon Nucleu s Cell hnRNA (heterogeneous nuclear RNA) Mixture of primary transcripts with varying numbers of introns spliced. mRNA
  12. 3/24/2023 ©Bud Mishra, 2001 L2-12 Some Genes… Gene Product Organism Exon Length #Intron Intron Length Adenoshine deaminase Human 1500 11 30,000 Apolipoprotein B Human 14,000 28 29,000 Erythropoietin Human 582 4 1562 Thyroglobulin Human 8500 = 40 100,000 a-interferon Human 600 0 0 Fibroin Silk Worm 18,000 1 970 Phaseolin French 1263 5 515
  13. 3/24/2023 ©Bud Mishra, 2001 L2-13 Regulation of Gene Expressions • Motifs (short DNA sequences) that regulate transcription – Promoter – Terminator • Motifs that modulate transcription – Repressor – Activator – Antiterminator Promoter Gene Transcription al Initiation Transcription al Termination Terminator 10-35bp
  14. 3/24/2023 ©Bud Mishra, 2001 L2-14 Promoters • pol I (RNA polymerase I) – Transcribes ribosomal RNA genes 100 » 1000 bp in front of the gene • pol II (RNA polymerase II) – Transcribes genes encoding polypeptides – Complex and variable regulatory regions • pol I (RNA polymerase I) – Transcribes transfer RNA and other small RNAs – Both up and down stream
  15. 3/24/2023 ©Bud Mishra, 2001 L2-15 Motifs • Each motif is a binding site for a specific protein • Transcription Factor: – Transcription factors (specific to a cell/environmental conditions) bind to regulatory regions and facilitate • Assembly of RNA polymerase into a transcriptional complex • Activation of a transcriptional complex. • Termination Factor: – Assembly of proteins for termination and modification of the end of the RNA • Epigenetic Changes – Methylation of the cytosine in the 5’ region – Structural changes in cromatin
  16. 3/24/2023 ©Bud Mishra, 2001 L2-16 Organization of Genetic Information • Bacterial Genome: – Genes are closely spaced along the DNA. – The sequences of genes may overlap. – Related genes (encoding enzymes whose functions are part of the same pathway or whose activities are related) are linked as a single transcription unit.
  17. 3/24/2023 ©Bud Mishra, 2001 L2-17 Organization of Genetic Information • Eukaryotic Genome: – Genes are separated by long stretches of noncoding DNA sequences. – Multiple genes in a single transcription unit is extremely rare. – Multiple chromosomes – Linear – Chloroplasts and mitochondria – Circular – Genes appearing on the same chromosome are syntenic.
  18. 3/24/2023 ©Bud Mishra, 2001 L2-18 Location of Some Genes on Human Chromosome. Genes chromosome a-globin cluster 16 b-globin cluster 11 Immunoglobulin k (light chain) 2 l (light chain) 22 Heavy Chain 14 Pseudogenes 9,32,15,18 Growth Hormone 17 Thymidine kinase 17 Genes chromosome Insulin 11 Galactokinase 11 Viral oncogene C-sis 22 C-mos 8 C-Ha-Ras-1 11 C-myb 6 Interferons a & b luster 9 g 12
  19. 3/24/2023 ©Bud Mishra, 2001 L2-19 Eukaryotic Genome • Multiple copies of the same gene – Solve “supply problem” – There are several hundred robosomal RNA genes I mammals • Pseudogenes – Nonfunctional copies of genes…(Deletions or alterations in the DNA sequence) – Number of pseudo genes for a particular gene varies greatly…Different from one organism to another.
  20. 3/24/2023 ©Bud Mishra, 2001 L2-20 Genes in Eukaryotes • A gene may appear exactly once • It may be part of a family of repeated sequence . Members of a family may be clustered or dispersed. • Members of a gene family may be related and functional (expressed at different times in development, or in different cells) or may be pseudo genes. • Chromosomal Morphology: – Nucleolar organizers (genes for ribosomal RNA) – Telomeric and Centromeric regions (Tandemly repeated sequences)
  21. 3/24/2023 ©Bud Mishra, 2001 L2-21 The Rearrangement of DNA Sequences • Reshuffling of genes between homologous chromosomes via reciprocal crossing-over during both meiosis and mitosis. • Gene synteny and linkages are usually preserved. • Most rearrangements are random. • Some rearrangements are normal processes altering gene expressions in an orderly and programmed manner.
  22. 3/24/2023 ©Bud Mishra, 2001 L2-22 Chromosomal Aberrations • Breakage • Translocation (Among non-homologous chromosomes.) • Formation of acentric and dicentric chromosomes. • Gene Conversions • Amplification an deletions • Point mutations • Jumping genes a Transposition of DNA segments • Programmed rearrangements a E.g., antibody responses.
  23. 3/24/2023 ©Bud Mishra, 2001 L2-23 Repeat Structure • Copy Number: 2 » 106 • Direct Repeats “head-to-tail” – Tandem repeats or separated by other sequences • Inverted Repeats “head-to-head” – Stem-and-loop structure – Hairpin structure • Reverse Palindrome • True Palindrome
  24. 3/24/2023 ©Bud Mishra, 2001 L2-24 Repeat Structure • Tandem Direct Repeats • Inverted Repeats • Reverse Palindrome • True Palindrome 5’-AAGAG AAGAG AAGAG-3’ 5’-GTCCAGNL NCTGGAC-3’ CAGGTCNL NGACCTG G C A T C G C G T A G C Stem-and-loop structure Associated with inverted repeats 5’-GAATTC-3’ CTTAAG 5’-GTCAATGA AGTAACTG-3’
  25. 3/24/2023 ©Bud Mishra, 2001 L2-25 Repeats within the Genome • Gene Family – Genes and its cognate pseudogenes • Satellite: Repeats made of noncoding units – Minisatellites: Tandem repeats…Mostly in centromeric regions – Satellite repeat units vary in length freom 2 base pairs to several thousands.
  26. 3/24/2023 ©Bud Mishra, 2001 L2-26 Interspersed Repeats • SINES: Short Interspersed Repeats – Each repeat unit is of length 100 – 500 bps – Processed pseudogenes derived from class III genes – Example: Alu repeats…dimeric head-to-tail repeats of 130 bp • LINES: Long Interspersed Repeats – Each unit is of length > 6 Kb.
  27. 3/24/2023 ©Bud Mishra, 2001 L2-27 A Genome Grammar • Consists of – A stochastic grammar specifying target DNA sequence together with – A description of polymorphisms and – A description of the sampling strategy for experiments • h specificationi ! h DNA-Seg i h Poly-Seg i* h Sample-Seg i+
  28. 3/24/2023 ©Bud Mishra, 2001 L2-28 Stochastic Grammar • h DNA-Seg i ! “.dna” h DNA-Spec i • h Poly-Seg i ! “.poly” h Weight i+ h Poly-Spec i • h Sample-Seg i ! “.sample” h Sample-spec i
  29. 3/24/2023 ©Bud Mishra, 2001 L2-29 DNA Sequence • .dna A = 150 Ã sequence of length 150— Pr(A) = Pr(T) = Pr(C) = Pr(G) = ¼ B = A A m(.30) Ã A followed by a mutated copy of A---Pr(Mutation) = .30 C » 3-7 p(.2, .3, .3) Ã A string of length 3 to 7, Pr(A) =.2, Pr(T) = .3, Pr(C)=.3, Pr(G) = .2 ---C = Constant String D = C m(0.03) n(10,30) Ã m = mutation rate, n = copy number • S = 30,000,000 B m(.05, .10) p(.1,.1,.01) n(10) D !(500)
  30. 3/24/2023 ©Bud Mishra, 2001 L2-30 Polymorphisms • Modify the ancestral sequence by a series of – S Point mutation (SNPs) – D Deletions – X Translocations • .poly .8 .8 S 0.00012T D 1-1 .00012 D 2-2 .00006 D 3-3 .00002 D 500-1000 .00005 X 1000-2000 .0005 .poly .4 S .001 D 1-2 .0005 Two haplotypes of .8 each and one haplotype of weight .4
  31. 3/24/2023 ©Bud Mishra, 2001 L2-31 Sampling • .sample 48,000 Ã Number of Samples 400 600 .5 Ã Read Lengths .01 .02 Ã Sequence Read Errors .33 .33 Ã Failure of Read .3 1800 2200 .005 Ã Clone size • .sample 12,000 400 600 .5 .01 .03 .33 .33 .4 9000 11000 .015
  32. 3/24/2023 ©Bud Mishra, 2001 L2-32 Experiment • First sample generate 48,000 end reads from inserts of average length 2 Kbp. – Sample proportions: 40% from haplotype H1, 40% from H2 and 20% from H3 • Second sample generates 12,000 end reads from inserts of average length 10 Kbp. – Sample proportions: 40% from haplotype H1, 40% from H2 and 20% from H3
Anzeige