1. Molecular similarity By: Haytham Hijazi
searching methods Advisor: Univ-Prof. Hon-Prof. Dr. Dieter
in drug discovery Roller
A Presentation in advanced graphical
engineering systems seminar 2011/2012
1
2. In this work, I propose a contribution to the field of “Cheminformatic”.
Cheminformatic means solving chemical problems using computational methods[1].
James Rhodes, Stephen Boyer1, Jeffrey Kreulen, Ying Chen, Patricia Ordonez, “Mining patents using molecular similarity
search”, IBM, Almaden Services Research, Pacific Symposium on Biocomputing 12:304-315(2007).
Molecular similarity By: Haytham Hijazi
searching methods Advisor: Univ-Prof. Hon-Prof. Dr. Dieter
in drug discovery Roller
A Presentation in advanced graphical
engineering systems seminar 2011/2012
2
3. Agenda
•The main question in this research
•The principle of similarity
•Drug discovery as an application
•Research problem
• Molecular representations (1D, 2D…)
•Searching the similarity
•Similarity coefficients calculations
•The probabilistic model (BIM)
•The contribution (MDC)
•Experiments, conclusions and discussion
3
A Presentation in advanced graphical engineering
systems seminar 2011/2012
4. “The similarity is in the eye of the beholder”
Shape Colour
Size Pattern
4
5. Question: Which molecules in a database are
similar to the query
molecule?
Application: •better compounds than initial lead
compound (Drug discovery)
•Property prediction of unknown
compound.
5
6. Structurally similar molecules are assumed to have
similar biological properties.
Similar biological propritiesdrug discovery.
[1]
1. Sylvaine Roy and Laurence Lafanechère, “Chemogenomics and Chemical Genetics: A User's Introduction for
Biologists, Chemists and Informaticians”, Molecular similarity, Springer Berlin, ISBN 978-3-642-19614-0, 1st Edition. 6
8. Similarity coefficients
Molecule
Feature selection calculations and
represntation
ranking for search
8
9. Historical progression
◦ Complete structure
◦ Sub-Structure
Descriptors
◦ 1D (psychophysical properties), 2D, 3D, and 4D
Connectivity tables and graph theory!
Image Source: Karine Audouze, “Representation of molecular structures and structural
9
diversity”, ChemoInformatics in Drug Discovery, 2009.
10. SMILES
CCCC1=NN(C2=C1NC(=NC2=O)C3=C(C=
CC(=O)OC1=CC=CC=C1C(=O)O
CC(=C3)S(=O)(=O)N4CCN(CC4)C)OCC)C
SMILES – Simplified Molecular Line Entry System
Source: Karine Audouze, “Representation of molecular structures and structural
10
diversity”, ChemoInformatics in Drug Discovery, 2009.
11. A fingerprint is a vector encoding the presence (‘1’) or
absence (‘0’) of FRAGMENT substructures in a molecule
Dictionary based or and hash based fingerprints
Descriptor Fragment
1 AR
2 CCCCN
3 Me
9 NH2
[1]
[2]
2. Source: Karine Audouze, “Representation of molecular structures and structural diversity”,
11
ChemoInformatics in Drug Discovery, 2009.
12. In 3D keys the position of each bit
corresponds to a certain range of distances or
angels.
Computationally complex
Source: Karine Audouze, “Representation of molecular structures and structural
12
diversity”, ChemoInformatics in Drug Discovery, 2009.
13. Similarity coefficients
Molecule
Feature selection calculations and
represntation
ranking for search
13
15. The similarity measure (coefficient) is a
quantitative measure of similarity
Used to rank the results of the query
Results are ordered decreasingly
Distance coefficients.
Probabilistic coefficients.
Correlation coefficients.
Association coefficients.
15
16. Associative
Simple matching coefficient (c+d)/(a+b-c+d)
Jaccard measure (Tanimoto) c/(a+b-c) =AND/OR
Cosine, Ochiai c/√(a+b)(c+d)
Dice c/.5[(a+c)+(b+c)] and 2c/a+b
Distance
Hamming distance a+b-2c
Euclidean distance √a+b-2c
Soregel distance a+b-2c/a+b-c
Other coefficients
Pattern difference ab/(a+b c+d)2
Size (a-b)2/(a+b+c+d)2
Naomie Salim, “The study of probability model for compound similarity searching”, UTM Research
16
Management Centre Project Vote – 75207, University of Malaysia, 2009
17. Assume we generate the fingerprint fragment
based bits
Molecule A:
00010100010101000101010011110100
Molecule B:
00000000100101001001000011100000
c
Tanimoto coefficient =
Where c=A AND B (a b) c
Tanimoto=6/(13+8)-6=0.4
a c b
17
18. Associate the relevance of a structure to an
explicit feature
pi=probability that bit bi appears in an active structure.
qi=probability that bit bi appears in an inactive structure
αi represents a binary selector. If αi=1 means the bit occurs in the structure, else it is 0 and negated.
P (A|S) is the probability of an active structure given S.
P (NA|S) is the probability of an inactive structure given S.
P(A) is the probability of ACTIVEs
P(NA) is the probability of INACTIVES
Naomie Salim, “The study of probability model for compound similarity searching”, UTM Research
18
Management Centre Project Vote – 75207, University of Malaysia, 2009
20. Molecular
dynamic
simulating
tool Active
compounds
Database
Psychophysical properties Voting Class 1
Classification Class 2
Algorithm
Class n
20
21. Better insight about the similarity in terms of
bioactivity, toxicity, reactivity...(+)
The time of searching (+)
Prediction and voting possibilities (+)
Cost of simulation tools (-)
Classification errors (-)
21
22. Materials Explorer
Itemtracker -Freezer/Cryogen sample tracking system
CHARMM
MDynaMix
22
23. Fingerprint time gneration
30
25
20
Time (Ms) 15
2 bits
10
3 bits
5 4 bits
4 bits
0
3 bits
4 2 bits
5
6
7
8
Max path.length
Consider if we have more than 1000 bits!
Data source: simulating tool indicated in the report [17]
23
24. Hit rate
0.18
0.16
0.14
0.12
0.1
Hit Rate
0.08
Hit rate
0.06
0.04
0.02
0
0 500 1000 1500 2000 2500
Selection Size
The more we increase the size of features, the more the hit rate of finding actives decreaes.
Data source: simulating tool indicated in the report [17]
24
25. Even fingerprint fragment based is time
consuming
Probabilistic models and machine learning
introduced substantial changes
Mixing more than type of descriptors seems
efficient i.e. Time and results quality
Still need to have experimental results
25
26. Molecular similarity Thanks for your listening
searching methods
in drug discovery Haytham Hijazi
A Presentation to the advanced graphical
engineering systems seminar 2011/2012
26
Editor's Notes
1
Each bit in the fingerprint represents one molecular fragment