This document discusses various ways that computers represent chemical structures digitally for storage and analysis. It covers challenges like specifying hydrogen atoms and resonance structures, as well as tradeoffs between human readability versus computational utility. Common representation methods include graphs, bond matrices, Chemical Markup Language (CML), SMILES strings, and InChI identifiers. Fingerprinting techniques like atom pairs are used to quantify chemical similarity between structures.
3. • What constraints influence how we represent
compounds digitally?
• A few common chemical data structures
• Canonicalization & Hashing
• Fingerprinting and Similarity measures
3
Overview
4. A central struggle in Computer Science
Should hydrogen atoms be specified?
How to represent resonance?
How to provide material properties?
Computational
Efficiency
Memory
Efficiency
4
5. Another Tradeoff
Human
Readability
Computational
Utility
OC[C@H]1OC(O)[C@H](O)
[C@@H](O)[C@@H]1O
WQZGKKKJIJFFOK-GASJEMHNSA-
N
5
6. Computers ❤️ Graphs
• Graphs have nodes
and edges
• So do molecules!
• These nodes may have
spatial positions
• Hydrogen atoms can
really get in the way!
O
H
C
C
H
H
H
H
H
6
7. Encoding graphs
• Three ways with increasing
subtlety (more CPU, less
memory):
– Matrices
– Lists
– String
O
C
C
7
8. Bond Electron Matrix
C C O
C 0 1 0
C 1 0 1
O 0 1 0
O
C
C
A symmetric matrix with the values
corresponding the bond order between
two compounds
Not as space efficient but very easy to
manipulate computationally 8
9. Chemical Markup Language (CML)
is a list notation
<cml><MDocument><MChemicalStruct>
<molecule molID="m1">
<atomArray>
<atom id="a1" elementType="C" x2="-4.208333333333333” y2="1.4583333333333333"/>
<atom id="a2" elementType="C" x2="-2.801473328563728" y2="2.0847077636700657"/>
<atom id="a3" elementType="O" x2="-2.325587157226309" y2="3.549334798764602"/>
</atomArray>
<bondArray>
<bond atomRefs2="a1 a2" order="1"/>
<bond atomRefs2="a2 a3" order="1"/>
</bondArray>
</molecule>
</MChemicalStruct></MDocument></cml>
O
C
C
9
10. Mol files are also lists
3 2 0 0 0 0 0 0 0 0999 V2000
22.1200 -15.8397 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
20.9088 -16.5419 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
23.3312 -16.5419 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0
1 3 1 0 0 0
O
C
C
From
To
Order
Coordinates Type
SDF files are lists of molfiles and
properties (Listception!)
10
12. Try writing SMILES for Ethambutol
• CCC(CO)NCCNC(CC)CO
• What about:
– OCC(CC)NCCNC(CO)CC
– CCC(NCCNC(CO)CC)CO
– And many more
• What if we want to know if 2 compounds are
the same?
12
13. • R group – matches any group of atoms
• Query Atoms
– A – Matches any atom but hydrogen
– Q – Matches any atom but hydrogen or carbon
– M – Matches any metal
– X – Matches any halogen
– Atom lists – Match any of a specified set of elements
• Psudoatoms – an atom not on the periodic table.
Computers just treat them as text
13
Atoms that aren’t literal atoms
14. Canonicalization
[O-]C(=O)c1cc(O)cc(c1)O
14
Establish a canonical form of
the graph (Can be tricky!):
• Dominant tautomer
(resonance)
• Predominate chemical
species (charge)
Enumerate the graph in a predictable
way:
• Picking the starting atom
• Selecting which branch to follow at
branch points
SMILES can be canonical, InChIs always are
15. Identifying molecules
• Even a string representation can be a
cumbersome way to refer to molecules
• For example phospholipids:
– InChI=1S/C81H148O17P2/c1-5-9-13-17-21-25-29-33-37-41-45-49-53-57-61-65-
78(83)91-71-76(97-80(85)67-63-59-55-51-47-43-39-35-31-27-23-19-15-11-7-3)73-
95-99(87,88)93-69-75(82)70-94-100(89,90)96-74-77(98-81(86)68-64-60-56-52-48-
44-40-36-32-28-24-20-16-12-8-4)72-92-79(84)66-62-58-54-50-46-42-38-34-30-26-
22-18-14-10-6-2/h23,27,33-40,75-77,82H,5-22,24-26,28-32,41-74H2,1-
4H3,(H,87,88)(H,89,90)/b27-23-,37-33-,38-34-,39-35-,40-36-/t75?,76-,77-/m1/s1
• What we need is automatic name for this
compound
15
16. Hashing to the rescue
• We want a function that is:
– Deterministic (always gives the same output for the same
input)
– Fixed Length (usually)
– Uniform (makes good use of the space we allow it)
• There is no way to have 1:1 mapping, collisions can
happen (but very unlikely)
• Example InChIKeys
– HGIKPGJCIWRORL-TVFZIFOYSA-N
Connectivity Stereo etc.
Protonation
16
17. Fragment based Chemical Fingerprints
17
~400 Chemical Moieties which are ether present or absent
Used extensively in Pharmaceutical Science
18. Atom Pair Chemical Fingerprints
• Encode all atoms as a type
– -OH = 14
– -CH2- = 3
– -CH3 = 1
• Enumerate all distances between pairs
– 14 – (2) – 3
– 3 – (2) – 1
– 14 – (3) – 1
• Hash the result
O
C
C
18
19. Your Turn!
• Find the unique atom types
and count unique atom pairs
– 5 unique atom types
• -CH3, -CH2-, -CH<, -OH, -NH-
– ~23 unique atom pairs
19
20. Quantitative Chemical Similarity
20
Tanimoto Coefficient
(no similarity) 0 < τ < 1 (exactly similar vector)
We can quantitatively
describe chemical
similarity by
computation.
[ 0 1 0 0 1 ]
HO
O
O O
O P O P OH
OH
OH
OH
OH
[ 0 1 0 1 1 ]
τ = 0.2