2. The Information Content of the Proteome Knowledge Information Data 1) cdc2+, cyclinB+, Mitosis, 2) cdc2-, Arrest 3) cdc2 Binds Importin alpha/beta. …
3. Evolution of a Relational Proteome NCBI PDB SCOP PDGF-VSIS … 1965 1975 1985 1995 2005 HGP Insulin Atlas Smith Waterman; NEWAT Needleman Wunsch REFSEQ SWISSPROT Protein Domains
4. Data vs. Knowledge Data > Information Sequences Structures/Functions http://bytesizebio.net/http://www.dna.affrc.go.jp/growth/images/P-grwth-entrs.gifPLoS Comput Biol. 2006 Aug 25;2(8):e114. Epub 2006 Jul 14.Genome Res. 2008 March; 18(3): 449–461. doi: 10.1101/gr.6943508.http://www.rcsb.org/pdb/statistics/contentGrowthChart.do?content=fold-scop
5. An Integrated Framework for building Molecular Biological Data Marts Putting the model to use …
6. Data Marts : Targeted Integration FlatData Repositories function structure sequence taxonomy
7. A Family of Data Driven Molecular Biology Tools Integrated of structure calculation via NMR. -hybrid methods, iterative processing, reproducibility spectra,sequence,chemical shifts -> structure Automated detection of signaling/binding motifs in a candidate protein. protein sequence -> biological activity Filtration of “passenger” residues from specificity/functional residues on surfaces of protein structures . sequence + structure - > function “Multidimensional” Sequence Comparison sequence + taxonomy -> evolution
22. Further (GO) integration controls for the degenerate nature of motif searches ~400 ~400 ~900 PLOS One, 2010
23. Short Sequences are degenerate…Can they be merged withstructural and evolutionaryinformation ? Chemistry & Biology, January 2000 BMC Genomics, 2009
24. Venn : An Integrated ApplicationFor Database Driven HomologyThreading of Protein Structures …. Nucleic Acids Research, 2009 Trends in Plant sciences, 2010
26. VENN-InterfaceMiner : How do different SH3 binding peptides functionally relate to one another ? Left to right … 1AZG (Human FYN) PRPLPVAP LYYGDWIPSNY 1AVZ (Human FYN) TPQVPL YD … GDWPSNY 1PRL (Chicken FYN) APPLPR YD ... WPNY (not shown) 1H3H (Mouse GRB2) SRSTK ENPSWWTLPANY
29. What happens when a sequence is inherently noisy ? max 100-250 eval 10E-3 ... word size3-5 score matrix 80,62,30 gap?0,4 Q/N? manskysktdvqqvkrqnqqsasgqgqygtef gsetdaqqvrkqnqsaeqnkqqns
31. Use a hypersensitive sequence search(+), and expand results into a 2nd dimension (-). Combined with taxonomical information To pinpoint a first estimate of the gene’s appearance. J. Bacteriology 2011
32. R3 : A prototypical methodfor improved structure calculation.
38. VENN : Fine grained analysis of SH3 bound peptides--- reveals a similar interface for divergent sequences. Are the peptides similar to ? Left to right … 1AZG (Human FYN) PRPLPVAP LYYGDWIPSNY 1AVZ (Human FYN) TPQVPL YD … GDWPSNY 1PRL (Chicken FYN) APPLPR YD ... WPNY 1H3H (Mouse GRB2) SRSTK ENPSWWTLPANY
39. Solution : Use an hypersensitive sequence search, and expand results into a 2nd dimension. Combined with taxonomical information pinpoints a first estimate of the gene’s appearance.
40. Gene Duplication, Domain Reuse, Functional Motifs, and Varaince of Structural Specificity - "Twilight Zone" homologies - Structural Interfaces - Binding Specificity - Short Functional Motifs Vertebrates appear to have arranged pre-existing components into a richer collection of domain architectures. Nature 2001
41. Doolittle * Functional Protein Bioinformatics - CDD, MnM, Modular evolution of Proteins * Database Normalization - "Archival" -> low S/N ; unrepresentative * Protein-centric sequence searching - Rous Sarcoma Discovery (DNA, lost in translation) ***** All done before modern computing/database theory.
42. The Modern Age Gen Bank - archival NCBI / EBI - sequence data curation PDB/BMRB - structural data curation, deposition GO - functional annotations ...............................
43. What is data modelling ? - Ambiguety vs. Vagueness - "Text" vs "Syntax" - Biological Data : No clear "reference object". Solution : CONTEXT
45. When To Federate ? * New Genomes... Draft sequences. * Reproducibility is less important than insight.
46. Stark et Al. Control of the G2/M Transition 2006
47. Problem: There are hundreds of native peptides which possess subsequences which are predicted to have SH3 binding properties. For example [KR]..[KR] and P..P are known to interact with SH3 domains. But there is no method for comparing the structural binding mechanisms behind these variant peptides. This is necessary, given the fact that there are hundreds of SH3 domains in the human genome, with several diverse structures existing in the protein data bank, which cannot be collectively analyzed by eye. Solution: Use the VENN program for homology titration to extract molecular interfaces from SH3 bound peptides. 1) For each atom “a1” in each peptide chain of a structure For each atom in “a2” DIFFERENT chain of the same structure. Is “a1” close to “a2” ? If yes, store a1,a2. If no, keep going. 2) Now, create a “synthetic structure”, which extracts residues associated with only atoms stored in step (1), which ignores covalent peptide bonds entirely. This structure represents a molecular interface, where all non interacting residues are considered to be “extraneous noise”. 3) To test the biological relevance of the molecular interface, apply it to varying species : Is the same signature generated from different structures ? Conclusion: Although the W/P/N/Y residues in SH3 domains are far apart and variably spaced in sequence distance, they may have evolved to possess a common feature : Conformance to a highly specific molecular interface. Mouse GRB2 / Human FYN are completely different domains, in different species, which bind different peptides …. Yet surprisingly, their binding sites conform to the same interface. Venn is available at http://sbtools.uchc.edu/venn. Results Left to right … 1AZG (Human FYN) PRPLPVAP bindsLYYGDWIPSNY 1AVZ (Human FYN) TPQVPL bindsYD … GDWPSNY 1PRL (Chicken FYN) APPLPR binds YD ... WPNY 1H3H (Mouse GRB2) SRSTK binds ENPSWWTLPANY
Data is the evidence of our measurements, essentially useless, except for book keeping. Information is data that is meaninfgul ; has relationships, and context. Knowledge is readily useful factual models and descriptions. Like the model of CDC2’s role in the G2/M transition.The reason why we do computational biology is that the vast amount of proteins and networks in the cell cannotPossibly be “held” in the mind of a human being – 20,000 proteins, easily 10,000 in a liver cell, with concentrations of up To 1 million per cell. Promiscuiity and regulation, which determine cell fate and physiology cannot be readily analyzed by any one person.Thus , we have the “proteome” – the collection of relationships , sequences, and structures that allow us to make classify and make general conclusionsAbout the specific networks of protein driven processes in the cell……
1700s linnaeus : classification of life forms is more important than just tallying them up.1980s doolittle : archiving proteins is not of value unless we classify them in a non redundant manner that’s consistent with how proteins evolved, via duplication.Doolittle, interested in gene duplication, not a computer guy, built NEWAT as a new version of margaretdayhoffs atlas and encouragd people to use it For protein-centric sequence searching…. And was able to find that different proteins in different organisms shared common features on a grand scale.
Where are we now ? We now know that doolittle was right – the human genome is highly modular, with one of the highest enrichments of multidomainProteins of any organisms. Maybe by integrating information, we can transfer informattion between proteins more efficiently and effectively….thus decreasing the gap betweenSequence data and sequence knowledge….