SlideShare ist ein Scribd-Unternehmen logo
1 von 26
The 24th International Conference on Genome
Informatics (GIW2013), Dec. 16 2013.

Scalable prediction of compoundprotein interactions using minwise
hashing
Yasuo Tabei (PRESTO, JST)
Joint work with
Yoshihiro Yamanishi (Kyushu Univ.)
Drug target interactions
• Most drugs are small molecules that interact with
one or several target proteins
• Analyzing functional interactions between small
compounds and proteins plays an important role
in genomic drug discovery
Genome-wide prediction of unknown
compound-protein interactions

•Yamanishi, Y., et al, Bioinformatics (ISMB2008), 24:i232i240, 2008.
•Faulon et al., Bioinformatics, 24:225-233, 2008
•Jacob et al, Bioinformatics, 24:2149-2156, 2008
•Bleakley et al, Bioinformatics, 25:2397-2403, 2009.
Fingerprints (binary vectors) of
compound and protein

• Compounds represented by PubChem substructures

• Proteins represented by PFAM domains
4,137 elements
Fingerprint representation of
compound-protein pairs
• Tensor product of each compound and protein pair

– All possible products of compound substructures and
PFAM domains

• Observation: fingerprint representation
• Large number of high dimensional fingerprints:
– Number: 216 million (=35,366×6,111)
– Dimension: 771,756(=881×876)
Existing methods
• Pairwise Kernel SVM [Faulon et al.,2008]
– Kernel matrix of inner products between each pair of
fingerprints of compound-protein pair
Large time complexity:
(nc, np: the number of compounds/proteins)
Large working space:

• Linear SVM (Ex: LIBLINEAR(Lin et al., 2007))
– Use fingerprints of compound-protein pair as an input
Large training time and working space

• Challenge: Developing a scalable prediction of
large-scale compound-protein interactions
Overview of our method
• Basic idea: build compact fingerprints from
fingerprints of compound-protein pairs
– Leverage an idea behind MinHash (Minwise Hashing)
[Broder et al., 2000]

• Train linear classifiers using compact fingerprints
– Smaller working space for training
– Short training time
– The same classification accuracy as previous
methods
– Interpretability of features
MinHash [Brodal et al., 2000]
• Mapping a set into a string of length
1. Generate a permutation
2. Apply each permutation to a set
3. Compute minimum of
as k-th integer
4. Iterate steps 1-3 for
Ex)
1
2

3

• Conserve the Jaccard similarity in the original
space
Saving memory by additional hashing
• Drawback of MinHash: Need large bits for
storing each hashed value
• Reduce the hashed value to a smaller value
– Apply a random hash function h: {1,..,M} → {1,…,N}
(N << M) to each hashed value

• Collision probability is derived as follows:

• J(Si,Sj): Jaccard similarity
Collision probability for various Jaccard
similarities J and additional hashings N
Procedure for building compact
fingerprints
SVM using compact fingerprints
• Use L1- and L2-regularizations to prevent
overfitting
• MH-L1SVM (L1-regularization)

• MH-L2SVM (L2-regularization)

• Use an efficient optimization algorithm named
LIBLINEAR (Lin et al., 2007)
Other details
• Linear SVM with compact fingerprints simulates
non-linear SVM with pairwise kernels
– Can simulates non-linear SVM with linear SVM

• Can extract important features for predicting
compound-protein interactions
– Use reverse hashing functions

• See our paper for more details
Experiments
• 216 million compound-protein pairs that includes
300,202 interacting pairs
– Unbalanced data

• Use AUC score, training time and memory as
evaluation measures
• Compare MH-L1SVM and MH-L2SVM to L1SVM
and L2SVM
–

L1- and L2-regularized SVM with fingerprints
computed by tensor products
Two types of 5-fold cross validation
AUC score of MH-L1SVM by varying the
length of hashed strings l
Balanced dataset of 600,404 compound-protein pairs
Training time of MH-L1SVM for varying
the length of string (N=216)
Balanced dataset of 600,404 compound-protein pairs

Maximum AUC score
Memory for the number of compoundprotein pairs (l=10, N=216)
AUC score and training time on 216 million
compound-protein pairs
(l=10, N=216)

Measure

AUC score
Training
time (sec)

MH-L1SVM MH-L2SVM L1SVM

0.79
15,713

0.81-

L2SVM

-

10,054> 48hours > 48hours
The number of extracted features
Summary
• Scalable prediction of compound-protein
interactions using minwise hashing
• Applicable to 216 million compound protein pairs
• The same trends in the pair-wise cross
validation experiments can be observed in the
block-wise experiments (See our paper)
• Dataset and C++ implementation:
https://sites.google.com/site/interactminhash/
6000

The number of extracted features

1000

2000

3000

4000

L1SVM

0

Number of features

5000

L1LOG

0.0

0.5

1.0

1.5

2.0

Ratio of negative samples (log scale base 10)

2.5
AUC score on pair-wise cross validation
experiment (l=10, N=216)
(Ratio of the number of non-interacting pairs to that of
interacting pairs)

MH-L1SVM MH-L2SVM L1SVM L2SVM
Ratio Number
1
600,404
0.78
0.79
0.79
0.8
5 1,801,212
0.79
0.80
0.81
0.81
10 3,302,222
0.79
0.80
0.81
0.81
25 7,805,252
0.79
0.80
0.81
0.81
50 15,310,302
0.79
0.81
0.81
0.81
100 30,320,402
0.79
0.810.81
250 75,350,702
0.79
0.810.81
Training time (sec) on pair-wise cross
validation experiments (l=10, N=216)
(Ratio of the number of non-interacting pairs to that of
interacting pairs)

Ratio Number
MH-L1SVM MH-L2SVM L1SVM
L2SVM
1
600,404
29
28
188
387
5 1,801,212
172
38
1,655
963
10 3,302,222
448
2661
1,261
10,798
25 7,805,252
1,808
732
20,067
4,623
50 15,310,302
1,140
811
58,045
8,936
100 30,320,402
7,601
1,643> 24hours
16,608
250 75,350,702
25,060
4,631> 24hours
43,843
AUC score of MH-L2SVM by varying the
length l of hashed strings
Balanced dataset of 600,404 compound-protein pairs
Training time of MH-L2SVM for varying
the length of string
Balanced dataset of 600,404 compound-protein pairs

Maximum AUC score

Weitere ähnliche Inhalte

Andere mochten auch

Ibisml2011 06-20
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20
Yasuo Tabei
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306
Yasuo Tabei
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009
Yasuo Tabei
 
WABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTreeWABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTree
Yasuo Tabei
 
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
Yasuo Tabei
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 public
Yasuo Tabei
 
Pagerank
PagerankPagerank
Pagerank
Gabriel
 

Andere mochten auch (20)

Ibisml2011 06-20
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20
 
Kdd2015reading-tabei
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabei
 
Lp Boost
Lp BoostLp Boost
Lp Boost
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009
 
WABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTreeWABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTree
 
Gwt sdm public
Gwt sdm publicGwt sdm public
Gwt sdm public
 
Lgm saarbrucken
Lgm saarbruckenLgm saarbrucken
Lgm saarbrucken
 
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
 
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 public
 
異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法
 
ウェーブレット木の世界
ウェーブレット木の世界ウェーブレット木の世界
ウェーブレット木の世界
 
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
 
bigdata2012nlp okanohara
bigdata2012nlp okanoharabigdata2012nlp okanohara
bigdata2012nlp okanohara
 
Ec ulecture 21_september2012
Ec ulecture 21_september2012Ec ulecture 21_september2012
Ec ulecture 21_september2012
 
Pagerank
PagerankPagerank
Pagerank
 
Freelance Robotics Company Profile V1 2
Freelance Robotics Company Profile V1 2Freelance Robotics Company Profile V1 2
Freelance Robotics Company Profile V1 2
 
Meeting Change Game
Meeting Change GameMeeting Change Game
Meeting Change Game
 
‘INDUSTRY CLUSTERS - A NEW MODEL FOR INDUSTRY INNOVATION AND COMMERCIALISATION
‘INDUSTRY CLUSTERS - A NEW MODEL FOR INDUSTRY INNOVATION AND COMMERCIALISATION‘INDUSTRY CLUSTERS - A NEW MODEL FOR INDUSTRY INNOVATION AND COMMERCIALISATION
‘INDUSTRY CLUSTERS - A NEW MODEL FOR INDUSTRY INNOVATION AND COMMERCIALISATION
 

Ähnlich wie GIW2013

PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
neeraj7svp
 
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
DataScienceConferenc1
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...
CSCJournals
 
Parameters Optimization for Improving ASR Performance in Adverse Real World N...
Parameters Optimization for Improving ASR Performance in Adverse Real World N...Parameters Optimization for Improving ASR Performance in Adverse Real World N...
Parameters Optimization for Improving ASR Performance in Adverse Real World N...
Waqas Tariq
 
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)
Asiri Wijesinghe
 

Ähnlich wie GIW2013 (20)

ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code
ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing codeISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code
ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...
 
D111823
D111823D111823
D111823
 
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
 
Stock market analysis using ga and neural network
Stock market analysis using ga and neural networkStock market analysis using ga and neural network
Stock market analysis using ga and neural network
 
Biomedical Signals Classification With Transformer Based Model.pptx
Biomedical Signals Classification With Transformer Based Model.pptxBiomedical Signals Classification With Transformer Based Model.pptx
Biomedical Signals Classification With Transformer Based Model.pptx
 
Developmental Mega Sample: Exploring Inter-Individual Variation
Developmental Mega Sample: Exploring Inter-Individual VariationDevelopmental Mega Sample: Exploring Inter-Individual Variation
Developmental Mega Sample: Exploring Inter-Individual Variation
 
I046850
I046850I046850
I046850
 
SecondaryStructurePredictionReport
SecondaryStructurePredictionReportSecondaryStructurePredictionReport
SecondaryStructurePredictionReport
 
50120140503004
5012014050300450120140503004
50120140503004
 
Generator of pseudorandom sequences
Generator of pseudorandom sequences Generator of pseudorandom sequences
Generator of pseudorandom sequences
 
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...
 
EFFICIENT USE OF HYBRID ADAPTIVE NEURO-FUZZY INFERENCE SYSTEM COMBINED WITH N...
EFFICIENT USE OF HYBRID ADAPTIVE NEURO-FUZZY INFERENCE SYSTEM COMBINED WITH N...EFFICIENT USE OF HYBRID ADAPTIVE NEURO-FUZZY INFERENCE SYSTEM COMBINED WITH N...
EFFICIENT USE OF HYBRID ADAPTIVE NEURO-FUZZY INFERENCE SYSTEM COMBINED WITH N...
 
A refined solution to classical unit commitment
A refined solution to classical unit commitmentA refined solution to classical unit commitment
A refined solution to classical unit commitment
 
Parameters Optimization for Improving ASR Performance in Adverse Real World N...
Parameters Optimization for Improving ASR Performance in Adverse Real World N...Parameters Optimization for Improving ASR Performance in Adverse Real World N...
Parameters Optimization for Improving ASR Performance in Adverse Real World N...
 
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger
 
Test vector compression in Digital Testing
Test vector compression in Digital Testing Test vector compression in Digital Testing
Test vector compression in Digital Testing
 

Kürzlich hochgeladen

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 

GIW2013

  • 1. The 24th International Conference on Genome Informatics (GIW2013), Dec. 16 2013. Scalable prediction of compoundprotein interactions using minwise hashing Yasuo Tabei (PRESTO, JST) Joint work with Yoshihiro Yamanishi (Kyushu Univ.)
  • 2. Drug target interactions • Most drugs are small molecules that interact with one or several target proteins • Analyzing functional interactions between small compounds and proteins plays an important role in genomic drug discovery
  • 3. Genome-wide prediction of unknown compound-protein interactions •Yamanishi, Y., et al, Bioinformatics (ISMB2008), 24:i232i240, 2008. •Faulon et al., Bioinformatics, 24:225-233, 2008 •Jacob et al, Bioinformatics, 24:2149-2156, 2008 •Bleakley et al, Bioinformatics, 25:2397-2403, 2009.
  • 4. Fingerprints (binary vectors) of compound and protein • Compounds represented by PubChem substructures • Proteins represented by PFAM domains 4,137 elements
  • 5. Fingerprint representation of compound-protein pairs • Tensor product of each compound and protein pair – All possible products of compound substructures and PFAM domains • Observation: fingerprint representation • Large number of high dimensional fingerprints: – Number: 216 million (=35,366×6,111) – Dimension: 771,756(=881×876)
  • 6. Existing methods • Pairwise Kernel SVM [Faulon et al.,2008] – Kernel matrix of inner products between each pair of fingerprints of compound-protein pair Large time complexity: (nc, np: the number of compounds/proteins) Large working space: • Linear SVM (Ex: LIBLINEAR(Lin et al., 2007)) – Use fingerprints of compound-protein pair as an input Large training time and working space • Challenge: Developing a scalable prediction of large-scale compound-protein interactions
  • 7. Overview of our method • Basic idea: build compact fingerprints from fingerprints of compound-protein pairs – Leverage an idea behind MinHash (Minwise Hashing) [Broder et al., 2000] • Train linear classifiers using compact fingerprints – Smaller working space for training – Short training time – The same classification accuracy as previous methods – Interpretability of features
  • 8. MinHash [Brodal et al., 2000] • Mapping a set into a string of length 1. Generate a permutation 2. Apply each permutation to a set 3. Compute minimum of as k-th integer 4. Iterate steps 1-3 for Ex) 1 2 3 • Conserve the Jaccard similarity in the original space
  • 9. Saving memory by additional hashing • Drawback of MinHash: Need large bits for storing each hashed value • Reduce the hashed value to a smaller value – Apply a random hash function h: {1,..,M} → {1,…,N} (N << M) to each hashed value • Collision probability is derived as follows: • J(Si,Sj): Jaccard similarity
  • 10. Collision probability for various Jaccard similarities J and additional hashings N
  • 11. Procedure for building compact fingerprints
  • 12. SVM using compact fingerprints • Use L1- and L2-regularizations to prevent overfitting • MH-L1SVM (L1-regularization) • MH-L2SVM (L2-regularization) • Use an efficient optimization algorithm named LIBLINEAR (Lin et al., 2007)
  • 13. Other details • Linear SVM with compact fingerprints simulates non-linear SVM with pairwise kernels – Can simulates non-linear SVM with linear SVM • Can extract important features for predicting compound-protein interactions – Use reverse hashing functions • See our paper for more details
  • 14. Experiments • 216 million compound-protein pairs that includes 300,202 interacting pairs – Unbalanced data • Use AUC score, training time and memory as evaluation measures • Compare MH-L1SVM and MH-L2SVM to L1SVM and L2SVM – L1- and L2-regularized SVM with fingerprints computed by tensor products
  • 15. Two types of 5-fold cross validation
  • 16. AUC score of MH-L1SVM by varying the length of hashed strings l Balanced dataset of 600,404 compound-protein pairs
  • 17. Training time of MH-L1SVM for varying the length of string (N=216) Balanced dataset of 600,404 compound-protein pairs Maximum AUC score
  • 18. Memory for the number of compoundprotein pairs (l=10, N=216)
  • 19. AUC score and training time on 216 million compound-protein pairs (l=10, N=216) Measure AUC score Training time (sec) MH-L1SVM MH-L2SVM L1SVM 0.79 15,713 0.81- L2SVM - 10,054> 48hours > 48hours
  • 20. The number of extracted features
  • 21. Summary • Scalable prediction of compound-protein interactions using minwise hashing • Applicable to 216 million compound protein pairs • The same trends in the pair-wise cross validation experiments can be observed in the block-wise experiments (See our paper) • Dataset and C++ implementation: https://sites.google.com/site/interactminhash/
  • 22. 6000 The number of extracted features 1000 2000 3000 4000 L1SVM 0 Number of features 5000 L1LOG 0.0 0.5 1.0 1.5 2.0 Ratio of negative samples (log scale base 10) 2.5
  • 23. AUC score on pair-wise cross validation experiment (l=10, N=216) (Ratio of the number of non-interacting pairs to that of interacting pairs) MH-L1SVM MH-L2SVM L1SVM L2SVM Ratio Number 1 600,404 0.78 0.79 0.79 0.8 5 1,801,212 0.79 0.80 0.81 0.81 10 3,302,222 0.79 0.80 0.81 0.81 25 7,805,252 0.79 0.80 0.81 0.81 50 15,310,302 0.79 0.81 0.81 0.81 100 30,320,402 0.79 0.810.81 250 75,350,702 0.79 0.810.81
  • 24. Training time (sec) on pair-wise cross validation experiments (l=10, N=216) (Ratio of the number of non-interacting pairs to that of interacting pairs) Ratio Number MH-L1SVM MH-L2SVM L1SVM L2SVM 1 600,404 29 28 188 387 5 1,801,212 172 38 1,655 963 10 3,302,222 448 2661 1,261 10,798 25 7,805,252 1,808 732 20,067 4,623 50 15,310,302 1,140 811 58,045 8,936 100 30,320,402 7,601 1,643> 24hours 16,608 250 75,350,702 25,060 4,631> 24hours 43,843
  • 25. AUC score of MH-L2SVM by varying the length l of hashed strings Balanced dataset of 600,404 compound-protein pairs
  • 26. Training time of MH-L2SVM for varying the length of string Balanced dataset of 600,404 compound-protein pairs Maximum AUC score