SlideShare ist ein Scribd-Unternehmen logo
1 von 21
MINE Databases for 
Metabolite Repair: 
A Workshop 
James Jeffryes 
10/16/14 
1
REPRESENTING 
CHEMICALS 
DIGITALLY 
2
• What constraints influence how we represent 
compounds digitally? 
• A few common chemical data structures 
• Canonicalization & Hashing 
• Fingerprinting and Similarity measures 
3 
Overview
A central struggle in Computer Science 
Should hydrogen atoms be specified? 
How to represent resonance? 
How to provide material properties? 
Computational 
Efficiency 
Memory 
Efficiency 
4
Another Tradeoff 
Human 
Readability 
Computational 
Utility 
OC[C@H]1OC(O)[C@H](O) 
[C@@H](O)[C@@H]1O 
WQZGKKKJIJFFOK-GASJEMHNSA- 
N 
5
Computers ❤️ Graphs 
• Graphs have nodes 
and edges 
• So do molecules! 
• These nodes may have 
spatial positions 
• Hydrogen atoms can 
really get in the way! 
O 
H 
C 
C 
H 
H 
H 
H 
H 
6
Encoding graphs 
• Three ways with increasing 
subtlety (more CPU, less 
memory): 
– Matrices 
– Lists 
– String 
O 
C 
C 
7
Bond Electron Matrix 
C C O 
C 0 1 0 
C 1 0 1 
O 0 1 0 
O 
C 
C 
A symmetric matrix with the values 
corresponding the bond order between 
two compounds 
Not as space efficient but very easy to 
manipulate computationally 8
Chemical Markup Language (CML) 
is a list notation 
<cml><MDocument><MChemicalStruct> 
<molecule molID="m1"> 
<atomArray> 
<atom id="a1" elementType="C" x2="-4.208333333333333” y2="1.4583333333333333"/> 
<atom id="a2" elementType="C" x2="-2.801473328563728" y2="2.0847077636700657"/> 
<atom id="a3" elementType="O" x2="-2.325587157226309" y2="3.549334798764602"/> 
</atomArray> 
<bondArray> 
<bond atomRefs2="a1 a2" order="1"/> 
<bond atomRefs2="a2 a3" order="1"/> 
</bondArray> 
</molecule> 
</MChemicalStruct></MDocument></cml> 
O 
C 
C 
9
Mol files are also lists 
3 2 0 0 0 0 0 0 0 0999 V2000 
22.1200 -15.8397 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 
20.9088 -16.5419 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 
23.3312 -16.5419 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 
1 2 1 0 0 0 
1 3 1 0 0 0 
O 
C 
C 
From 
To 
Order 
Coordinates Type 
SDF files are lists of molfiles and 
properties (Listception!) 
10
Simplified molecular-input 
line-entry system (SMILES) 
CCO …That’s it! 
How about something a bit more tricky? 
O=Cc1ccc(O)c(OC)cc1 
O 
C 
C 
11
Try writing SMILES for Ethambutol 
• CCC(CO)NCCNC(CC)CO 
• What about: 
– OCC(CC)NCCNC(CO)CC 
– CCC(NCCNC(CO)CC)CO 
– And many more 
• What if we want to know if 2 compounds are 
the same? 
12
• R group – matches any group of atoms 
• Query Atoms 
– A – Matches any atom but hydrogen 
– Q – Matches any atom but hydrogen or carbon 
– M – Matches any metal 
– X – Matches any halogen 
– Atom lists – Match any of a specified set of elements 
• Psudoatoms – an atom not on the periodic table. 
Computers just treat them as text 
13 
Atoms that aren’t literal atoms
Canonicalization 
[O-]C(=O)c1cc(O)cc(c1)O 
14 
Establish a canonical form of 
the graph (Can be tricky!): 
• Dominant tautomer 
(resonance) 
• Predominate chemical 
species (charge) 
Enumerate the graph in a predictable 
way: 
• Picking the starting atom 
• Selecting which branch to follow at 
branch points 
SMILES can be canonical, InChIs always are
Identifying molecules 
• Even a string representation can be a 
cumbersome way to refer to molecules 
• For example phospholipids: 
– InChI=1S/C81H148O17P2/c1-5-9-13-17-21-25-29-33-37-41-45-49-53-57-61-65- 
78(83)91-71-76(97-80(85)67-63-59-55-51-47-43-39-35-31-27-23-19-15-11-7-3)73- 
95-99(87,88)93-69-75(82)70-94-100(89,90)96-74-77(98-81(86)68-64-60-56-52-48- 
44-40-36-32-28-24-20-16-12-8-4)72-92-79(84)66-62-58-54-50-46-42-38-34-30-26- 
22-18-14-10-6-2/h23,27,33-40,75-77,82H,5-22,24-26,28-32,41-74H2,1- 
4H3,(H,87,88)(H,89,90)/b27-23-,37-33-,38-34-,39-35-,40-36-/t75?,76-,77-/m1/s1 
• What we need is automatic name for this 
compound 
15
Hashing to the rescue 
• We want a function that is: 
– Deterministic (always gives the same output for the same 
input) 
– Fixed Length (usually) 
– Uniform (makes good use of the space we allow it) 
• There is no way to have 1:1 mapping, collisions can 
happen (but very unlikely) 
• Example InChIKeys 
– HGIKPGJCIWRORL-TVFZIFOYSA-N 
Connectivity Stereo etc. 
Protonation 
16
Fragment based Chemical Fingerprints 
17 
~400 Chemical Moieties which are ether present or absent 
Used extensively in Pharmaceutical Science
Atom Pair Chemical Fingerprints 
• Encode all atoms as a type 
– -OH = 14 
– -CH2- = 3 
– -CH3 = 1 
• Enumerate all distances between pairs 
– 14 – (2) – 3 
– 3 – (2) – 1 
– 14 – (3) – 1 
• Hash the result 
O 
C 
C 
18
Your Turn! 
• Find the unique atom types 
and count unique atom pairs 
– 5 unique atom types 
• -CH3, -CH2-, -CH<, -OH, -NH- 
– ~23 unique atom pairs 
19
Quantitative Chemical Similarity 
20 
Tanimoto Coefficient 
(no similarity) 0 < τ < 1 (exactly similar vector) 
We can quantitatively 
describe chemical 
similarity by 
computation. 
[ 0 1 0 0 1 ] 
HO 
O 
O O 
O P O P OH 
OH 
OH 
OH 
OH 
[ 0 1 0 1 1 ] 
τ = 0.2
QUESTIONS? 
21

Weitere ähnliche Inhalte

Andere mochten auch

Database Comparison: Social Behavior of the Great White Shark
Database Comparison: Social Behavior of the Great White SharkDatabase Comparison: Social Behavior of the Great White Shark
Database Comparison: Social Behavior of the Great White SharkAileen Marshall
 
Using Social Media2
Using Social Media2Using Social Media2
Using Social Media2Jane Hart
 
Brain NECSTwork - FPGA because
Brain NECSTwork - FPGA becauseBrain NECSTwork - FPGA because
Brain NECSTwork - FPGA becauseBrain NECSTwork
 
Don't Stop Believing Says Michelle Lin
Don't Stop Believing Says Michelle LinDon't Stop Believing Says Michelle Lin
Don't Stop Believing Says Michelle LinZillionDesigns
 
Herramientas de comunicación en línea
Herramientas de comunicación en líneaHerramientas de comunicación en línea
Herramientas de comunicación en líneaMarisol Bolaños
 
Content is Currency - FinanceConnect 2015
Content is Currency - FinanceConnect 2015Content is Currency - FinanceConnect 2015
Content is Currency - FinanceConnect 2015LinkedIn India
 
The Trouble With Tribbles: How LOLcats Ate Our Engagement
The Trouble With Tribbles: How LOLcats Ate Our EngagementThe Trouble With Tribbles: How LOLcats Ate Our Engagement
The Trouble With Tribbles: How LOLcats Ate Our EngagementJeffrey Stevens
 
Moving Beyond The Newsletter: Using Technology To Meet Parents Where They Are
Moving Beyond The Newsletter: Using Technology To Meet Parents Where They AreMoving Beyond The Newsletter: Using Technology To Meet Parents Where They Are
Moving Beyond The Newsletter: Using Technology To Meet Parents Where They AreChris Wejr
 
A importância da colaboração na Web 2.0
A importância da colaboração na Web 2.0A importância da colaboração na Web 2.0
A importância da colaboração na Web 2.0UTFPR
 
Droidcon Paris: The new Android SDK
Droidcon Paris: The new Android SDKDroidcon Paris: The new Android SDK
Droidcon Paris: The new Android SDKPayPal
 
Everyone has a story
Everyone has a storyEveryone has a story
Everyone has a storyshepatte
 
Housing, the 2015 General Election and Beyond: 10 Key Themes
Housing, the 2015 General Election and Beyond: 10 Key ThemesHousing, the 2015 General Election and Beyond: 10 Key Themes
Housing, the 2015 General Election and Beyond: 10 Key ThemesIpsos UK
 
Adversarial Learning_Rupam Bhattacharya
Adversarial Learning_Rupam BhattacharyaAdversarial Learning_Rupam Bhattacharya
Adversarial Learning_Rupam BhattacharyaRupam Bhattacharya
 
Competency-Based Education Step-by-Step Guide
Competency-Based Education Step-by-Step GuideCompetency-Based Education Step-by-Step Guide
Competency-Based Education Step-by-Step GuideWonderlic
 
Accelerating Impact: Exploring Best Practices, Challenges, and Innovations in...
Accelerating Impact: Exploring Best Practices, Challenges, and Innovations in...Accelerating Impact: Exploring Best Practices, Challenges, and Innovations in...
Accelerating Impact: Exploring Best Practices, Challenges, and Innovations in...The Rockefeller Foundation
 
Startup Highway Workshop
Startup Highway WorkshopStartup Highway Workshop
Startup Highway WorkshopPayPal
 

Andere mochten auch (17)

Database Comparison: Social Behavior of the Great White Shark
Database Comparison: Social Behavior of the Great White SharkDatabase Comparison: Social Behavior of the Great White Shark
Database Comparison: Social Behavior of the Great White Shark
 
Using Social Media2
Using Social Media2Using Social Media2
Using Social Media2
 
Brain NECSTwork - FPGA because
Brain NECSTwork - FPGA becauseBrain NECSTwork - FPGA because
Brain NECSTwork - FPGA because
 
Don't Stop Believing Says Michelle Lin
Don't Stop Believing Says Michelle LinDon't Stop Believing Says Michelle Lin
Don't Stop Believing Says Michelle Lin
 
Herramientas de comunicación en línea
Herramientas de comunicación en líneaHerramientas de comunicación en línea
Herramientas de comunicación en línea
 
Content is Currency - FinanceConnect 2015
Content is Currency - FinanceConnect 2015Content is Currency - FinanceConnect 2015
Content is Currency - FinanceConnect 2015
 
The Trouble With Tribbles: How LOLcats Ate Our Engagement
The Trouble With Tribbles: How LOLcats Ate Our EngagementThe Trouble With Tribbles: How LOLcats Ate Our Engagement
The Trouble With Tribbles: How LOLcats Ate Our Engagement
 
Moving Beyond The Newsletter: Using Technology To Meet Parents Where They Are
Moving Beyond The Newsletter: Using Technology To Meet Parents Where They AreMoving Beyond The Newsletter: Using Technology To Meet Parents Where They Are
Moving Beyond The Newsletter: Using Technology To Meet Parents Where They Are
 
A importância da colaboração na Web 2.0
A importância da colaboração na Web 2.0A importância da colaboração na Web 2.0
A importância da colaboração na Web 2.0
 
Droidcon Paris: The new Android SDK
Droidcon Paris: The new Android SDKDroidcon Paris: The new Android SDK
Droidcon Paris: The new Android SDK
 
Everyone has a story
Everyone has a storyEveryone has a story
Everyone has a story
 
Housing, the 2015 General Election and Beyond: 10 Key Themes
Housing, the 2015 General Election and Beyond: 10 Key ThemesHousing, the 2015 General Election and Beyond: 10 Key Themes
Housing, the 2015 General Election and Beyond: 10 Key Themes
 
Adversarial Learning_Rupam Bhattacharya
Adversarial Learning_Rupam BhattacharyaAdversarial Learning_Rupam Bhattacharya
Adversarial Learning_Rupam Bhattacharya
 
Competency-Based Education Step-by-Step Guide
Competency-Based Education Step-by-Step GuideCompetency-Based Education Step-by-Step Guide
Competency-Based Education Step-by-Step Guide
 
Crucigrama
Crucigrama Crucigrama
Crucigrama
 
Accelerating Impact: Exploring Best Practices, Challenges, and Innovations in...
Accelerating Impact: Exploring Best Practices, Challenges, and Innovations in...Accelerating Impact: Exploring Best Practices, Challenges, and Innovations in...
Accelerating Impact: Exploring Best Practices, Challenges, and Innovations in...
 
Startup Highway Workshop
Startup Highway WorkshopStartup Highway Workshop
Startup Highway Workshop
 

Ähnlich wie Representing Chemicals Digitally: An overview of Cheminformatics

The PubChemQC Project
The PubChemQC ProjectThe PubChemQC Project
The PubChemQC ProjectMaho Nakata
 
Efficient matching of multiple chemical subgraphs
Efficient matching of multiple chemical subgraphsEfficient matching of multiple chemical subgraphs
Efficient matching of multiple chemical subgraphsNextMove Software
 
Substructure Search Face-off
Substructure Search Face-offSubstructure Search Face-off
Substructure Search Face-offNextMove Software
 
Vlsi model question paper 3 (june 2021)
Vlsi model question paper 3 (june 2021)Vlsi model question paper 3 (june 2021)
Vlsi model question paper 3 (june 2021)PUSHPALATHAV1
 
RDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsRDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsNextMove Software
 
Flexscore: Ensemble-based evaluation for protein Structure models
Flexscore: Ensemble-based evaluation for protein Structure modelsFlexscore: Ensemble-based evaluation for protein Structure models
Flexscore: Ensemble-based evaluation for protein Structure modelsPurdue University
 
Introduction to Chainer Chemistry
Introduction to Chainer ChemistryIntroduction to Chainer Chemistry
Introduction to Chainer ChemistryPreferred Networks
 
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & RRajarshi Guha
 
A Preliminary survey of RDF/Neo4j as backends for KnetMiner
A Preliminary survey of RDF/Neo4j as backends for KnetMinerA Preliminary survey of RDF/Neo4j as backends for KnetMiner
A Preliminary survey of RDF/Neo4j as backends for KnetMinerRothamsted Research, UK
 
10 years gate solved papers CHEMISTRY(Upto 2014)
10 years gate solved papers CHEMISTRY(Upto 2014)10 years gate solved papers CHEMISTRY(Upto 2014)
10 years gate solved papers CHEMISTRY(Upto 2014)Raghab Gorain
 

Ähnlich wie Representing Chemicals Digitally: An overview of Cheminformatics (20)

Approaches for extraction and digital chromatography of chemical data
Approaches for extraction and digital chromatography of chemical dataApproaches for extraction and digital chromatography of chemical data
Approaches for extraction and digital chromatography of chemical data
 
The PubChemQC Project
The PubChemQC ProjectThe PubChemQC Project
The PubChemQC Project
 
1st Semester Chemistry Cycle (Dec-2015; Jan-2016) Question Papers
1st Semester Chemistry Cycle  (Dec-2015; Jan-2016) Question Papers1st Semester Chemistry Cycle  (Dec-2015; Jan-2016) Question Papers
1st Semester Chemistry Cycle (Dec-2015; Jan-2016) Question Papers
 
Efficient matching of multiple chemical subgraphs
Efficient matching of multiple chemical subgraphsEfficient matching of multiple chemical subgraphs
Efficient matching of multiple chemical subgraphs
 
Substructure Search Face-off
Substructure Search Face-offSubstructure Search Face-off
Substructure Search Face-off
 
Vlsi model question paper 3 (june 2021)
Vlsi model question paper 3 (june 2021)Vlsi model question paper 3 (june 2021)
Vlsi model question paper 3 (june 2021)
 
RDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsRDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical Depictions
 
Flexscore: Ensemble-based evaluation for protein Structure models
Flexscore: Ensemble-based evaluation for protein Structure modelsFlexscore: Ensemble-based evaluation for protein Structure models
Flexscore: Ensemble-based evaluation for protein Structure models
 
5th Semester (June-2016) Computer Science and Information Science Engineering...
5th Semester (June-2016) Computer Science and Information Science Engineering...5th Semester (June-2016) Computer Science and Information Science Engineering...
5th Semester (June-2016) Computer Science and Information Science Engineering...
 
8th semester Computer Science and Information Science Engg (2013 December) Qu...
8th semester Computer Science and Information Science Engg (2013 December) Qu...8th semester Computer Science and Information Science Engg (2013 December) Qu...
8th semester Computer Science and Information Science Engg (2013 December) Qu...
 
8th Semester (June; July-2015) Computer Science and Information Science Engin...
8th Semester (June; July-2015) Computer Science and Information Science Engin...8th Semester (June; July-2015) Computer Science and Information Science Engin...
8th Semester (June; July-2015) Computer Science and Information Science Engin...
 
Introduction to Chainer Chemistry
Introduction to Chainer ChemistryIntroduction to Chainer Chemistry
Introduction to Chainer Chemistry
 
4th Semester Mechanical Engineering (Dec-2015; Jan-2016) Question Papers
4th Semester Mechanical  Engineering (Dec-2015; Jan-2016) Question Papers4th Semester Mechanical  Engineering (Dec-2015; Jan-2016) Question Papers
4th Semester Mechanical Engineering (Dec-2015; Jan-2016) Question Papers
 
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & R
 
A Preliminary survey of RDF/Neo4j as backends for KnetMiner
A Preliminary survey of RDF/Neo4j as backends for KnetMinerA Preliminary survey of RDF/Neo4j as backends for KnetMiner
A Preliminary survey of RDF/Neo4j as backends for KnetMiner
 
1st Semester M Tech CMOS VLSI Design (Dec-2013) Question Papers
1st Semester M Tech CMOS VLSI Design (Dec-2013) Question Papers1st Semester M Tech CMOS VLSI Design (Dec-2013) Question Papers
1st Semester M Tech CMOS VLSI Design (Dec-2013) Question Papers
 
10 years gate solved papers CHEMISTRY(Upto 2014)
10 years gate solved papers CHEMISTRY(Upto 2014)10 years gate solved papers CHEMISTRY(Upto 2014)
10 years gate solved papers CHEMISTRY(Upto 2014)
 
some_other_API
some_other_APIsome_other_API
some_other_API
 
1st and 2nd Semester M Tech: Computer Science and Engineering (Dec-2015; Jan-...
1st and 2nd Semester M Tech: Computer Science and Engineering (Dec-2015; Jan-...1st and 2nd Semester M Tech: Computer Science and Engineering (Dec-2015; Jan-...
1st and 2nd Semester M Tech: Computer Science and Engineering (Dec-2015; Jan-...
 
4th semester Civil Engineering (2013-June) Question Papers
4th semester Civil Engineering (2013-June) Question Papers 4th semester Civil Engineering (2013-June) Question Papers
4th semester Civil Engineering (2013-June) Question Papers
 

Kürzlich hochgeladen

COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professormuralinath2
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxSuji236384
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to VirusesAreesha Ahmad
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....muralinath2
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Servicemonikaservice1
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceAlex Henderson
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsOrtegaSyrineMay
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Youngkajalvid75
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxseri bangash
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑Damini Dixit
 

Kürzlich hochgeladen (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to Viruses
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 

Representing Chemicals Digitally: An overview of Cheminformatics

  • 1. MINE Databases for Metabolite Repair: A Workshop James Jeffryes 10/16/14 1
  • 3. • What constraints influence how we represent compounds digitally? • A few common chemical data structures • Canonicalization & Hashing • Fingerprinting and Similarity measures 3 Overview
  • 4. A central struggle in Computer Science Should hydrogen atoms be specified? How to represent resonance? How to provide material properties? Computational Efficiency Memory Efficiency 4
  • 5. Another Tradeoff Human Readability Computational Utility OC[C@H]1OC(O)[C@H](O) [C@@H](O)[C@@H]1O WQZGKKKJIJFFOK-GASJEMHNSA- N 5
  • 6. Computers ❤️ Graphs • Graphs have nodes and edges • So do molecules! • These nodes may have spatial positions • Hydrogen atoms can really get in the way! O H C C H H H H H 6
  • 7. Encoding graphs • Three ways with increasing subtlety (more CPU, less memory): – Matrices – Lists – String O C C 7
  • 8. Bond Electron Matrix C C O C 0 1 0 C 1 0 1 O 0 1 0 O C C A symmetric matrix with the values corresponding the bond order between two compounds Not as space efficient but very easy to manipulate computationally 8
  • 9. Chemical Markup Language (CML) is a list notation <cml><MDocument><MChemicalStruct> <molecule molID="m1"> <atomArray> <atom id="a1" elementType="C" x2="-4.208333333333333” y2="1.4583333333333333"/> <atom id="a2" elementType="C" x2="-2.801473328563728" y2="2.0847077636700657"/> <atom id="a3" elementType="O" x2="-2.325587157226309" y2="3.549334798764602"/> </atomArray> <bondArray> <bond atomRefs2="a1 a2" order="1"/> <bond atomRefs2="a2 a3" order="1"/> </bondArray> </molecule> </MChemicalStruct></MDocument></cml> O C C 9
  • 10. Mol files are also lists 3 2 0 0 0 0 0 0 0 0999 V2000 22.1200 -15.8397 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 20.9088 -16.5419 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 23.3312 -16.5419 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 0 0 1 3 1 0 0 0 O C C From To Order Coordinates Type SDF files are lists of molfiles and properties (Listception!) 10
  • 11. Simplified molecular-input line-entry system (SMILES) CCO …That’s it! How about something a bit more tricky? O=Cc1ccc(O)c(OC)cc1 O C C 11
  • 12. Try writing SMILES for Ethambutol • CCC(CO)NCCNC(CC)CO • What about: – OCC(CC)NCCNC(CO)CC – CCC(NCCNC(CO)CC)CO – And many more • What if we want to know if 2 compounds are the same? 12
  • 13. • R group – matches any group of atoms • Query Atoms – A – Matches any atom but hydrogen – Q – Matches any atom but hydrogen or carbon – M – Matches any metal – X – Matches any halogen – Atom lists – Match any of a specified set of elements • Psudoatoms – an atom not on the periodic table. Computers just treat them as text 13 Atoms that aren’t literal atoms
  • 14. Canonicalization [O-]C(=O)c1cc(O)cc(c1)O 14 Establish a canonical form of the graph (Can be tricky!): • Dominant tautomer (resonance) • Predominate chemical species (charge) Enumerate the graph in a predictable way: • Picking the starting atom • Selecting which branch to follow at branch points SMILES can be canonical, InChIs always are
  • 15. Identifying molecules • Even a string representation can be a cumbersome way to refer to molecules • For example phospholipids: – InChI=1S/C81H148O17P2/c1-5-9-13-17-21-25-29-33-37-41-45-49-53-57-61-65- 78(83)91-71-76(97-80(85)67-63-59-55-51-47-43-39-35-31-27-23-19-15-11-7-3)73- 95-99(87,88)93-69-75(82)70-94-100(89,90)96-74-77(98-81(86)68-64-60-56-52-48- 44-40-36-32-28-24-20-16-12-8-4)72-92-79(84)66-62-58-54-50-46-42-38-34-30-26- 22-18-14-10-6-2/h23,27,33-40,75-77,82H,5-22,24-26,28-32,41-74H2,1- 4H3,(H,87,88)(H,89,90)/b27-23-,37-33-,38-34-,39-35-,40-36-/t75?,76-,77-/m1/s1 • What we need is automatic name for this compound 15
  • 16. Hashing to the rescue • We want a function that is: – Deterministic (always gives the same output for the same input) – Fixed Length (usually) – Uniform (makes good use of the space we allow it) • There is no way to have 1:1 mapping, collisions can happen (but very unlikely) • Example InChIKeys – HGIKPGJCIWRORL-TVFZIFOYSA-N Connectivity Stereo etc. Protonation 16
  • 17. Fragment based Chemical Fingerprints 17 ~400 Chemical Moieties which are ether present or absent Used extensively in Pharmaceutical Science
  • 18. Atom Pair Chemical Fingerprints • Encode all atoms as a type – -OH = 14 – -CH2- = 3 – -CH3 = 1 • Enumerate all distances between pairs – 14 – (2) – 3 – 3 – (2) – 1 – 14 – (3) – 1 • Hash the result O C C 18
  • 19. Your Turn! • Find the unique atom types and count unique atom pairs – 5 unique atom types • -CH3, -CH2-, -CH<, -OH, -NH- – ~23 unique atom pairs 19
  • 20. Quantitative Chemical Similarity 20 Tanimoto Coefficient (no similarity) 0 < τ < 1 (exactly similar vector) We can quantitatively describe chemical similarity by computation. [ 0 1 0 0 1 ] HO O O O O P O P OH OH OH OH OH [ 0 1 0 1 1 ] τ = 0.2