SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Downloaden Sie, um offline zu lesen
How am I supposed to organize a
 protein database when I can't
even organize my address book?
                         Jeremy Yang
                          UNM & IU



    CINF Flash session - ACS National Meeting, March 25, 2012 – San Diego, CA
Alternate title
 (and take home message):

Cheminformatics is so great!

  But is it too good to be
   (transferably) true?

                               2 / 17
How great is cheminformatics?

Example: Are these the same or different molecules?




                                                      3 / 17
How great is cheminformatics?

Example: Are these the same or different molecules?




 Answer: Same, that’s easy, just use canonical graph algorithm
 via canonical SMILES:
     CNC1C(O)C(O)C(CO)OC1OC2C(OC(C)C2(O)C=O)OC7C(O)C(O)C(NC(=N)NCNC(=O)C4=C(O)C(C3CC6C(=C(O)C3(O)C4=O)C(=O)c5c(O)cccc5C6(C)O)N(C)C)C(O)C7NC(N)=N

                                  (TETRACYCLINOMETHYLSTREPTOMYCIN)
                                                                                                                                                   4 / 17
Thanks to…




?                ?

                     5 / 17
Thanks to…




Harry Morgan        Dave Weininger
Actor, “MASH”          Daylight
 (Hmmm…?)              (SMILES)
                                         / 17
                                     6
Thanks to…




  Harry Morgan                         Dave Weininger
     ACS CAS                              Daylight
(Morgan Algorithm)                        (SMILES)
                     Et al., et al….                    7 / 17
Now about those proteins…
  • Example: Are these the same or different proteins?

1YIN:
ALSLTADQMVSALLDAEPPILYSEYDPTRPFSEASMMGLLTNLADRELVHMINWAKRVPGFVDLTLHDQVHLLECAWLEI
LMIGLVWRSMEHPGKLLFAPNLLLDRNQGKCVEGMVEIFDMLLATSSRFRMMNLQGEEFVCLKSIILLNSGVYTFLSSTL
KSLEEKDHIHRVLDKITDTLIHLMAKAGLTLQQQHQRLAQLLLILSHIRHMSNKGMEHLYSMKCKNVVPLYDLLLEMLDA
HRLHAPTS




3OS8:
SNAKRSKKNSLALSLTADQMVSALLDAEPPILYSEYDPTRPFSEASMMGLLTNLADRELVHMINWAKRVPGFVDLTRHDQ
VHLLECAWLEILMIGLVWRSMEHPGKLLFAPNLLLDRNQGKCVEGMVEIFDMLLATSSRFRMMNLQGEEFVCLKSIILLN
SGVYTFLSSTLKSLEEKDHIHRVLDKITDTLIHLMAKAGLTLQQQHQRLAQLLLILSHIRHMSNKGMEHLYSMKCKNVVP
SYDLLLEMLDAHRLHAPT




  PAM250 alignment score:
  (gap: -3; extend: -10)

         1156 (1156/1260 = 92%)
                                                                             Ergo, um… Maybe.
                                                                                                8 / 17
Now about those proteins…
• Example: Are these the same or different proteins?




Answer: Same… but what does that even mean?



                                                       9 / 17
Why protein identification is hard
• Proteins are large, complex, dynamic
• PDB is database of crystallography
  experiments, not molecules
• Ligands, co-crystals, waters
• Protein crystallography & NMR is hard
• History, culture…



                                          10 / 17
How about human identification?
    (Should be easier, may shed light…)




                                          11 / 17
Human identification hard too,
        apparently…
                               Credit card fraud
Homeland security




                     http://forms.cybersource.com/forms/NAFRDQ12012whi
                                                               12 / 17
                     tepaperFraudReport2012CYBSwww2012
(Which brings us to…)
My address book problems




       How many Rob Yangs?




                             13 / 17
(Philosophical tangent:)
Are human entities actually identifiable?




   One Harry Morgan or two?
      How can we know?




                                              14 / 17
(Philosophical tangent:)
Are human entities actually identifiable?



                                     Individuality may be contextual.
   One Harry Morgan or two?
      How can we know?




                                                                        15 / 17
Could I organize my address book
    using cheminformatics?

   What would the algorithm look like?




                                         16 / 17
Conclusions
• “CINF” (cheminformatics) is awesome.
• But some CINF-awesomeness is not readily
  transferable to other domains.
• Cannot automate logic if not logical (How
  many Harry Morgans?).
• Perhaps CINF-awesomeness can be used as an
  indexing approach for chem-related domains.


              “Chester”
                                            17 / 17

Weitere ähnliche Inhalte

Mehr von Jeremy Yang

Badapple: promiscuity patterns from noisy evidence (poster)
Badapple: promiscuity patterns from noisy evidence (poster)Badapple: promiscuity patterns from noisy evidence (poster)
Badapple: promiscuity patterns from noisy evidence (poster)Jeremy Yang
 
Bibliological data science and drug discovery
Bibliological data science and drug discoveryBibliological data science and drug discovery
Bibliological data science and drug discoveryJeremy Yang
 
BioMISS: Language Diversity of Computing
BioMISS: Language Diversity of ComputingBioMISS: Language Diversity of Computing
BioMISS: Language Diversity of ComputingJeremy Yang
 
The Language Diversity of Computing
The Language Diversity of ComputingThe Language Diversity of Computing
The Language Diversity of ComputingJeremy Yang
 
RMSD: routine measure stirs doubts
RMSD: routine measure stirs doubtsRMSD: routine measure stirs doubts
RMSD: routine measure stirs doubtsJeremy Yang
 
Canonicalized systematic nomenclature in cheminformatics
Canonicalized systematic nomenclature in cheminformaticsCanonicalized systematic nomenclature in cheminformatics
Canonicalized systematic nomenclature in cheminformaticsJeremy Yang
 
Molecular scaffolds poster
Molecular scaffolds posterMolecular scaffolds poster
Molecular scaffolds posterJeremy Yang
 
Molecular scaffolds are special and useful guides to discovery
Molecular scaffolds are special and useful guides to discoveryMolecular scaffolds are special and useful guides to discovery
Molecular scaffolds are special and useful guides to discoveryJeremy Yang
 
The BADAPPLE promiscuity plugin for BARD
The BADAPPLE promiscuity plugin for BARDThe BADAPPLE promiscuity plugin for BARD
The BADAPPLE promiscuity plugin for BARDJeremy Yang
 
Cheminformatics Software Development: Case Studies
Cheminformatics Software Development: Case StudiesCheminformatics Software Development: Case Studies
Cheminformatics Software Development: Case StudiesJeremy Yang
 
UNM Division of Biocomputing public web applications
UNM Division of Biocomputing public web applicationsUNM Division of Biocomputing public web applications
UNM Division of Biocomputing public web applicationsJeremy Yang
 
Cyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in BiocomputingCyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in BiocomputingJeremy Yang
 
Promiscuous patterns and perils in PubChem and the MLSCN
Promiscuous patterns and perils in PubChem and the MLSCNPromiscuous patterns and perils in PubChem and the MLSCN
Promiscuous patterns and perils in PubChem and the MLSCNJeremy Yang
 

Mehr von Jeremy Yang (13)

Badapple: promiscuity patterns from noisy evidence (poster)
Badapple: promiscuity patterns from noisy evidence (poster)Badapple: promiscuity patterns from noisy evidence (poster)
Badapple: promiscuity patterns from noisy evidence (poster)
 
Bibliological data science and drug discovery
Bibliological data science and drug discoveryBibliological data science and drug discovery
Bibliological data science and drug discovery
 
BioMISS: Language Diversity of Computing
BioMISS: Language Diversity of ComputingBioMISS: Language Diversity of Computing
BioMISS: Language Diversity of Computing
 
The Language Diversity of Computing
The Language Diversity of ComputingThe Language Diversity of Computing
The Language Diversity of Computing
 
RMSD: routine measure stirs doubts
RMSD: routine measure stirs doubtsRMSD: routine measure stirs doubts
RMSD: routine measure stirs doubts
 
Canonicalized systematic nomenclature in cheminformatics
Canonicalized systematic nomenclature in cheminformaticsCanonicalized systematic nomenclature in cheminformatics
Canonicalized systematic nomenclature in cheminformatics
 
Molecular scaffolds poster
Molecular scaffolds posterMolecular scaffolds poster
Molecular scaffolds poster
 
Molecular scaffolds are special and useful guides to discovery
Molecular scaffolds are special and useful guides to discoveryMolecular scaffolds are special and useful guides to discovery
Molecular scaffolds are special and useful guides to discovery
 
The BADAPPLE promiscuity plugin for BARD
The BADAPPLE promiscuity plugin for BARDThe BADAPPLE promiscuity plugin for BARD
The BADAPPLE promiscuity plugin for BARD
 
Cheminformatics Software Development: Case Studies
Cheminformatics Software Development: Case StudiesCheminformatics Software Development: Case Studies
Cheminformatics Software Development: Case Studies
 
UNM Division of Biocomputing public web applications
UNM Division of Biocomputing public web applicationsUNM Division of Biocomputing public web applications
UNM Division of Biocomputing public web applications
 
Cyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in BiocomputingCyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in Biocomputing
 
Promiscuous patterns and perils in PubChem and the MLSCN
Promiscuous patterns and perils in PubChem and the MLSCNPromiscuous patterns and perils in PubChem and the MLSCN
Promiscuous patterns and perils in PubChem and the MLSCN
 

How am I supposed to organize a protein database when I can't even organize my address book?

  • 1. How am I supposed to organize a protein database when I can't even organize my address book? Jeremy Yang UNM & IU CINF Flash session - ACS National Meeting, March 25, 2012 – San Diego, CA
  • 2. Alternate title (and take home message): Cheminformatics is so great! But is it too good to be (transferably) true? 2 / 17
  • 3. How great is cheminformatics? Example: Are these the same or different molecules? 3 / 17
  • 4. How great is cheminformatics? Example: Are these the same or different molecules? Answer: Same, that’s easy, just use canonical graph algorithm via canonical SMILES: CNC1C(O)C(O)C(CO)OC1OC2C(OC(C)C2(O)C=O)OC7C(O)C(O)C(NC(=N)NCNC(=O)C4=C(O)C(C3CC6C(=C(O)C3(O)C4=O)C(=O)c5c(O)cccc5C6(C)O)N(C)C)C(O)C7NC(N)=N (TETRACYCLINOMETHYLSTREPTOMYCIN) 4 / 17
  • 5. Thanks to… ? ? 5 / 17
  • 6. Thanks to… Harry Morgan Dave Weininger Actor, “MASH” Daylight (Hmmm…?) (SMILES) / 17 6
  • 7. Thanks to… Harry Morgan Dave Weininger ACS CAS Daylight (Morgan Algorithm) (SMILES) Et al., et al…. 7 / 17
  • 8. Now about those proteins… • Example: Are these the same or different proteins? 1YIN: ALSLTADQMVSALLDAEPPILYSEYDPTRPFSEASMMGLLTNLADRELVHMINWAKRVPGFVDLTLHDQVHLLECAWLEI LMIGLVWRSMEHPGKLLFAPNLLLDRNQGKCVEGMVEIFDMLLATSSRFRMMNLQGEEFVCLKSIILLNSGVYTFLSSTL KSLEEKDHIHRVLDKITDTLIHLMAKAGLTLQQQHQRLAQLLLILSHIRHMSNKGMEHLYSMKCKNVVPLYDLLLEMLDA HRLHAPTS 3OS8: SNAKRSKKNSLALSLTADQMVSALLDAEPPILYSEYDPTRPFSEASMMGLLTNLADRELVHMINWAKRVPGFVDLTRHDQ VHLLECAWLEILMIGLVWRSMEHPGKLLFAPNLLLDRNQGKCVEGMVEIFDMLLATSSRFRMMNLQGEEFVCLKSIILLN SGVYTFLSSTLKSLEEKDHIHRVLDKITDTLIHLMAKAGLTLQQQHQRLAQLLLILSHIRHMSNKGMEHLYSMKCKNVVP SYDLLLEMLDAHRLHAPT PAM250 alignment score: (gap: -3; extend: -10) 1156 (1156/1260 = 92%) Ergo, um… Maybe. 8 / 17
  • 9. Now about those proteins… • Example: Are these the same or different proteins? Answer: Same… but what does that even mean? 9 / 17
  • 10. Why protein identification is hard • Proteins are large, complex, dynamic • PDB is database of crystallography experiments, not molecules • Ligands, co-crystals, waters • Protein crystallography & NMR is hard • History, culture… 10 / 17
  • 11. How about human identification? (Should be easier, may shed light…) 11 / 17
  • 12. Human identification hard too, apparently… Credit card fraud Homeland security http://forms.cybersource.com/forms/NAFRDQ12012whi 12 / 17 tepaperFraudReport2012CYBSwww2012
  • 13. (Which brings us to…) My address book problems How many Rob Yangs? 13 / 17
  • 14. (Philosophical tangent:) Are human entities actually identifiable? One Harry Morgan or two? How can we know? 14 / 17
  • 15. (Philosophical tangent:) Are human entities actually identifiable? Individuality may be contextual. One Harry Morgan or two? How can we know? 15 / 17
  • 16. Could I organize my address book using cheminformatics? What would the algorithm look like? 16 / 17
  • 17. Conclusions • “CINF” (cheminformatics) is awesome. • But some CINF-awesomeness is not readily transferable to other domains. • Cannot automate logic if not logical (How many Harry Morgans?). • Perhaps CINF-awesomeness can be used as an indexing approach for chem-related domains. “Chester” 17 / 17