GIW2013

The 24th International Conference on Genome
Informatics (GIW2013), Dec. 16 2013.

Scalable prediction of compoundprotein interactions using minwise
hashing
Yasuo Tabei (PRESTO, JST)
Joint work with
Yoshihiro Yamanishi (Kyushu Univ.)

Drug target interactions
• Most drugs are small molecules that interact with
one or several target proteins
• Analyzing functional interactions between small
compounds and proteins plays an important role
in genomic drug discovery

Genome-wide prediction of unknown
compound-protein interactions

•Yamanishi, Y., et al, Bioinformatics (ISMB2008), 24:i232i240, 2008.
•Faulon et al., Bioinformatics, 24:225-233, 2008
•Jacob et al, Bioinformatics, 24:2149-2156, 2008
•Bleakley et al, Bioinformatics, 25:2397-2403, 2009.

Fingerprints (binary vectors) of
compound and protein

• Compounds represented by PubChem substructures

• Proteins represented by PFAM domains
4,137 elements

Fingerprint representation of
compound-protein pairs
• Tensor product of each compound and protein pair

– All possible products of compound substructures and
PFAM domains

• Observation: fingerprint representation
• Large number of high dimensional fingerprints:
– Number: 216 million (=35,366×6,111)
– Dimension: 771,756(=881×876)

Existing methods
• Pairwise Kernel SVM [Faulon et al.,2008]
– Kernel matrix of inner products between each pair of
fingerprints of compound-protein pair
Large time complexity:
(nc, np: the number of compounds/proteins)
Large working space:

• Linear SVM (Ex: LIBLINEAR(Lin et al., 2007))
– Use fingerprints of compound-protein pair as an input
Large training time and working space

• Challenge: Developing a scalable prediction of
large-scale compound-protein interactions

Overview of our method
• Basic idea: build compact fingerprints from
fingerprints of compound-protein pairs
– Leverage an idea behind MinHash (Minwise Hashing)
[Broder et al., 2000]

• Train linear classifiers using compact fingerprints
– Smaller working space for training
– Short training time
– The same classification accuracy as previous
methods
– Interpretability of features

MinHash [Brodal et al., 2000]
• Mapping a set into a string of length
1. Generate a permutation
2. Apply each permutation to a set
3. Compute minimum of
as k-th integer
4. Iterate steps 1-3 for
Ex)
1
2

3

• Conserve the Jaccard similarity in the original
space

Saving memory by additional hashing
• Drawback of MinHash: Need large bits for
storing each hashed value
• Reduce the hashed value to a smaller value
– Apply a random hash function h: {1,..,M} → {1,…,N}
(N << M) to each hashed value

• Collision probability is derived as follows:

• J(Si,Sj): Jaccard similarity

Collision probability for various Jaccard
similarities J and additional hashings N

Procedure for building compact
fingerprints

SVM using compact fingerprints
• Use L1- and L2-regularizations to prevent
overfitting
• MH-L1SVM (L1-regularization)

• MH-L2SVM (L2-regularization)

• Use an efficient optimization algorithm named
LIBLINEAR (Lin et al., 2007)

Other details
• Linear SVM with compact fingerprints simulates
non-linear SVM with pairwise kernels
– Can simulates non-linear SVM with linear SVM

• Can extract important features for predicting
compound-protein interactions
– Use reverse hashing functions

• See our paper for more details

Experiments
• 216 million compound-protein pairs that includes
300,202 interacting pairs
– Unbalanced data

• Use AUC score, training time and memory as
evaluation measures
• Compare MH-L1SVM and MH-L2SVM to L1SVM
and L2SVM
–

L1- and L2-regularized SVM with fingerprints
computed by tensor products

Two types of 5-fold cross validation

AUC score of MH-L1SVM by varying the
length of hashed strings l
Balanced dataset of 600,404 compound-protein pairs

Training time of MH-L1SVM for varying
the length of string (N=216)

Maximum AUC score

Memory for the number of compoundprotein pairs (ｌ=10, N=216)

AUC score and training time on 216 million
compound-protein pairs
(ｌ=10, N=216)

Measure

AUC score
Training
time (sec)

MH-L1SVM MH-L2SVM L1SVM

0.79
15,713

0.81-

L2SVM

-

10,054> 48hours > 48hours

The number of extracted features

Summary
• Scalable prediction of compound-protein
interactions using minwise hashing
• Applicable to 216 million compound protein pairs
• The same trends in the pair-wise cross
validation experiments can be observed in the
block-wise experiments (See our paper)
• Dataset and C++ implementation:
https://sites.google.com/site/interactminhash/

6000

The number of extracted features

1000

2000

3000

4000

L1SVM

0

Number of features

5000

L1LOG

0.0

0.5

1.0

1.5

2.0

Ratio of negative samples (log scale base 10)

2.5

AUC score on pair-wise cross validation
experiment (ｌ=10, N=216)
(Ratio of the number of non-interacting pairs to that of
interacting pairs)

MH-L1SVM MH-L2SVM L1SVM L2SVM
Ratio Number
1
600,404
0.78
0.79
0.79
0.8
5 1,801,212
0.79
0.80
0.81
0.81
10 3,302,222
0.79
0.80
0.81
0.81
25 7,805,252
0.79
0.80
0.81
0.81
50 15,310,302
0.79
0.81
0.81
0.81
100 30,320,402
0.79
0.810.81
250 75,350,702
0.79
0.810.81

Training time (sec) on pair-wise cross
validation experiments (ｌ=10, N=216)
(Ratio of the number of non-interacting pairs to that of
interacting pairs)

Ratio Number
MH-L1SVM MH-L2SVM L1SVM
L2SVM
1
600,404
29
28
188
387
5 1,801,212
172
38
1,655
963
10 3,302,222
448
2661
1,261
10,798
25 7,805,252
1,808
732
20,067
4,623
50 15,310,302
1,140
811
58,045
8,936
100 30,320,402
7,601
1,643> 24hours
16,608
250 75,350,702
25,060
4,631> 24hours
43,843

AUC score of MH-L2SVM by varying the
length l of hashed strings

Training time of MH-L2SVM for varying
the length of string

Maximum AUC score

GIW2013

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie GIW2013

Ähnlich wie GIW2013 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

GIW2013