Molecular similarity searching methods, seminar

Molecular similarity By: Haytham Hijazi
searching methods Advisor: Univ-Prof. Hon-Prof. Dr. Dieter
in drug discovery Roller

A Presentation in advanced graphical
engineering systems seminar 2011/2012

1

In this work, I propose a contribution to the field of “Cheminformatic”.
Cheminformatic means solving chemical problems using computational methods[1].

James Rhodes, Stephen Boyer1, Jeffrey Kreulen, Ying Chen, Patricia Ordonez, “Mining patents using molecular similarity
search”, IBM, Almaden Services Research, Pacific Symposium on Biocomputing 12:304-315(2007).

Molecular similarity By: Haytham Hijazi
searching methods Advisor: Univ-Prof. Hon-Prof. Dr. Dieter
in drug discovery Roller

A Presentation in advanced graphical

2

Agenda
•The main question in this research

•The principle of similarity

•Drug discovery as an application

•Research problem

• Molecular representations (1D, 2D…)

•Searching the similarity

•Similarity coefficients calculations

•The probabilistic model (BIM)

•The contribution (MDC)

•Experiments, conclusions and discussion
3
A Presentation in advanced graphical engineering
systems seminar 2011/2012

“The similarity is in the eye of the beholder”
Shape Colour

Size Pattern

4

Question: Which molecules in a database are
similar to the query
molecule?
Application: •better compounds than initial lead
compound (Drug discovery)
•Property prediction of unknown
compound.

5

 Structurally similar molecules are assumed to have
similar biological properties.

 Similar biological propritiesdrug discovery.

[1]

1. Sylvaine Roy and Laurence Lafanechère, “Chemogenomics and Chemical Genetics: A User's Introduction for
Biologists, Chemists and Informaticians”, Molecular similarity, Springer Berlin, ISBN 978-3-642-19614-0, 1st Edition. 6

Claim: General manufacturing problems!
7

Similarity coefficients
Molecule
Feature selection calculations and
represntation
ranking for search

8

 Historical progression
◦ Complete structure
◦ Sub-Structure

 Descriptors
◦ 1D (psychophysical properties), 2D, 3D, and 4D

 Connectivity tables and graph theory!

Image Source: Karine Audouze, “Representation of molecular structures and structural
9
diversity”, ChemoInformatics in Drug Discovery, 2009.

SMILES

CCCC1=NN(C2=C1NC(=NC2=O)C3=C(C=
CC(=O)OC1=CC=CC=C1C(=O)O
CC(=C3)S(=O)(=O)N4CCN(CC4)C)OCC)C

SMILES – Simplified Molecular Line Entry System
Source: Karine Audouze, “Representation of molecular structures and structural
10

 A fingerprint is a vector encoding the presence (‘1’) or
absence (‘0’) of FRAGMENT substructures in a molecule

 Dictionary based or and hash based fingerprints

Descriptor Fragment

1 AR

2 CCCCN

3 Me

9 NH2

[1]
[2]

2. Source: Karine Audouze, “Representation of molecular structures and structural diversity”,
11
ChemoInformatics in Drug Discovery, 2009.

 In 3D keys the position of each bit
corresponds to a certain range of distances or
angels.
 Computationally complex

Source: Karine Audouze, “Representation of molecular structures and structural
12

Similarity coefficients
Molecule
Feature selection calculations and
represntation
ranking for search

13

 Exact structure search
Structure search
 Substructure search

 Similarity searching: maximal common sub
graph isomorphism, Tanimoto/Dice/Cosine
coefficients

14

 The similarity measure (coefficient) is a
quantitative measure of similarity

 Used to rank the results of the query

 Results are ordered decreasingly

Distance coefficients.
Probabilistic coefficients.
Correlation coefficients.
Association coefficients.

15

Associative
Simple matching coefficient (c+d)/(a+b-c+d)
Jaccard measure (Tanimoto) c/(a+b-c) =AND/OR
Cosine, Ochiai c/√(a+b)(c+d)
Dice c/.5[(a+c)+(b+c)] and 2c/a+b
Distance
Hamming distance a+b-2c
Euclidean distance √a+b-2c
Soregel distance a+b-2c/a+b-c
Other coefficients
Pattern difference ab/(a+b c+d)2
Size (a-b)2/(a+b+c+d)2

Naomie Salim, “The study of probability model for compound similarity searching”, UTM Research
16
Management Centre Project Vote – 75207, University of Malaysia, 2009

 Assume we generate the fingerprint fragment
based bits
 Molecule A:
00010100010101000101010011110100
 Molecule B:
00000000100101001001000011100000
c
 Tanimoto coefficient =
 Where c=A AND B (a b) c

 Tanimoto=6/(13+8)-6=0.4

a c b

17

 Associate the relevance of a structure to an
explicit feature

 pi=probability that bit bi appears in an active structure.
 qi=probability that bit bi appears in an inactive structure
 αi represents a binary selector. If αi=1 means the bit occurs in the structure, else it is 0 and negated.
 P (A|S) is the probability of an active structure given S.
 P (NA|S) is the probability of an inactive structure given S.
 P(A) is the probability of ACTIVEs
 P(NA) is the probability of INACTIVES

Naomie Salim, “The study of probability model for compound similarity searching”, UTM Research
18
Management Centre Project Vote – 75207, University of Malaysia, 2009

Claim: General manufacturing problems !
19

Molecular
dynamic
simulating
tool Active
compounds
Database
Psychophysical properties Voting Class 1

Classification Class 2
Algorithm

Class n

20

 Better insight about the similarity in terms of
bioactivity, toxicity, reactivity...(+)

 The time of searching (+)

 Prediction and voting possibilities (+)

 Cost of simulation tools (-)

 Classification errors (-)

21

 Materials Explorer

 Itemtracker -Freezer/Cryogen sample tracking system

 CHARMM

 MDynaMix

22

Fingerprint time gneration

30

25

20

Time (Ms) 15
2 bits
10
3 bits
5 4 bits
4 bits
0
3 bits
4 2 bits
5
6
7
8

Max path.length

Consider if we have more than 1000 bits!

Data source: simulating tool indicated in the report [17]
23

Hit rate
0.18

0.16

0.14

0.12

0.1
Hit Rate

0.08
Hit rate
0.06

0.04

0.02

0

0 500 1000 1500 2000 2500

Selection Size

The more we increase the size of features, the more the hit rate of finding actives decreaes.

Data source: simulating tool indicated in the report [17]
24

 Even fingerprint fragment based is time
consuming

 Probabilistic models and machine learning
introduced substantial changes

 Mixing more than type of descriptors seems
efficient i.e. Time and results quality

 Still need to have experimental results

25

Molecular similarity Thanks for your listening
searching methods
in drug discovery Haytham Hijazi

A Presentation to the advanced graphical

26

Molecular similarity searching methods, seminar

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Molecular similarity searching methods, seminar

Similar to Molecular similarity searching methods, seminar (20)

Recently uploaded

Recently uploaded (20)

Molecular similarity searching methods, seminar

Editor's Notes