SlideShare a Scribd company logo
1 of 32
Download to read offline
DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING
Advanced Circuits, Architecture, and Computing Lab
BgN-Score & BsN-Score: Bagging & Boosting
Based Ensemble Neural Networks Scoring
Functions for Accurate Binding Affinity
Prediction of Protein-Ligand Complexes
Hossam M. Ashtawy
ashtawy@egr.msu.edu
The 9th IAPR conference on Pattern Recognition in Bioinformatics
(PRIB 2014)
August 21, 2014
Nihar R. Mahapatra
nrm@egr.msu.edu
Department of Electrical & Computer Engineering
Michigan State University, East Lansing, MI, U.S.A.
© 2014
Motivation
2
BA is the principal determinant of
many vital biological processes
Accurate prediction of BA is
challenging & remains unresolved prb.
Conventional SFs have limited
predictive power
Outline
3
•Scoring Functions
•Our Approach
Background &
Scope of Work
•Compound Characterization
•Ensemble Neural Networks
Materials &
Methods
•SFs’ Tuning, Training, & Testing
•SFs’ evaluation & comparison
Experiments &
Discussion
Background and Scope of Work
4
Docking & Scoring
5
Ensemble NNs = SF2
Scoring Challenges
6
Lack of accurate accounting of
intermolecular physicochemical
interactions
More descriptors = Curse of
dimensionality
Relationship between descriptors & BA
could be highly nonlinear
Multi-Layer Neural Network
7
 Theoretically, it can model any nonlinear
continuous function; however:
 Hard to tune its weights to optimal values
 Does not handle high dimensional data well
 Has high variance errors
 Is a black box model  lacks descriptive power
Our Approach & Scope of Work
8
Collect large number of PLCs with
known BA
Extract a diverse set of features
Train ensemble NN models based
on boosting and bagging
Evaluate resulting SFs on diverse
and homogeneous protein families
Materials and Methods
9
Compound Database: PDBbind
[3]
10
 Protein-ligand complexes obtained from
PDBbind 2007
 PDBbind is a selective compilation of the
Protein Data Bank (PDB) database
Compound Database: PDBbind
11
1a30 9hvp1d5j
Protein-ligand
complexes
from PDB
PDB key
1a30
1f9g 2qtg 2usn
1f9g 2qtg
PDBbind filters and Binding Affinity Collection
2usn
Feature Calculation (using X-Score, Affi-Score, RF-Score, and Gold tools)
ExperimentalData
PDB
Ligand’s
MW ≤ 1000
# non-hydrogen
atoms of the
ligand ≥ 6
Only one
ligand is
bound to
the protein
Protein &
ligand non-
covalently
bound
Resolution of the complex
crystal structure ≤ 2.5Å
Elements in complex
must be C, N, O, P, S, F,
Cl, Br, I, H
Known Kd or Ki
Hydrogenation
Protonation &
deprotonation
Refined set
of PDBbind
PDBbind: Refined Set
12
PDBbind: Core Set
13
Refined
set
Similarity
search using
BLAST
Similarity cutoff of
90%
Clusters
with ≥ 4
complexes
Binding affinity of highest-
affinity complex is 100-fold
the affinity of lowest one
First, middle, and
lowest affinity
complexes from
each cluster
Core set of
PDBbind
 Extracted features
calculated for the
following scoring
functions:
X-Score (6 features)
AffiScore (30 features).
RF-Score (36 features)
GOLD (14 features)
Compound Characterization
14
1a30 9hvp1d5j
Protein-ligand
complexes
from PDB
PDB key
1a30
Training and Test Datasets
1f9g 2qtg 2usn
1f9g 2qtg
PDBbind filters and Binding Affinity Collection
2usn
Feature Calculation (using X-Score, Affi-Score, RF-Score, and Gold tools)
ExperimentalData
 Primary training
set (1105): Pr
 Core test set
(195): Cr
Training and Test Datasets
15
1a30
Training and Test Datasets
1f9g 2qtg 2usn
Primary Training Dataset Pr Core Test Dataset Cr
Feature Calculation (using X-Score, Affi-Score, RF-Score, and Gold tools)
Ensemble NN boosting & bagging Algorithms Test Complex to Score
ExperimentalDat
X Dataset
A Dataset
R Dataset
G Dataset
XA Dataset
...XARG Dataset
X Dataset
A Dataset
R Dataset
G Dataset
XA Dataset
...XARG Dataset
Base Learner: A Neural Network
16
 Prediction of each network is
calculated as follows:
 Network weights are
optimized to minimize the
fitting criterion E:
Input
layer
Hidden
layer
Output
layer
wh,owi,h
+1
x1
x2
xP
Bindingaffinity
+1
Featuresofacomplex
BgN-Score
17
 An ensemble of MLP ANNs
grown
 Inputs to each ANN are a
random subset of p
features
 Each ANN trained on a
bootstrap dataset randomly
sampled with replacement
from training data
 After building the ensemble
model, the BA of a new
protein-ligand complex X is
computed by applying the
formula:
wh,owi,h
+1
x3
x21
x13
Bindingaffinity
+1
wh,owi,h
+1
x8
x51
x6
+1
wh,owi,h
+1
x6
x2
x37
+1
Featuresofacomplex
Average
BsN-Score
18
wh,owi,h
+1
x30
x19
x64
Binding affinity
+1
wh,owi,h
x39
+1
wh,owi,h
+1
x11
x2
x57
+1
x5
x8
Featuresofacomplex
+1
Featuresofacomplex
Featuresofacomplex
Conventional SFs
19
Empirical
SFs (9)
DS::PLP
DS::Jain
DS::Ludi
GLIDE::GlideScore
SYBYL::ChemScore
SYBYL::F-Score
GOLD::ChemScore
GOLD::ASP
X-Score
Knowledge
Based SFs (4)
DS::LigScore
DS::PMF
SYBYL::PMF
DrugScore
Force-field
SFs (3)
SYBYL::D-Score
SYBYL::G-Score
GOLD::GoldScore
Experiments, Results, and
Discussion
20
SF Construction & Application Workflow
21
Scoring Function Building
and Evaluation
Collecting Data
Feature Generation
Training Set Formation
Model Building
BsN-Score &
BgN-Score
Binding Affinity
Protein
3D
structure
Ligand
Feature Generation
Build Validate
Parameter Tuning
Training Data
Optimal
Parameters
Parameter Tuning: BgN-Score
22
Optimal parameters:
H~20, p ~ P/3, λ ~ 0.001 , N ~ 2000
Training
Net. 1
Training
Net. 851
Training
Net. 2000Parameter Set 1
Parameter Set i
Parameter Set θ
Generated
Parameter Sets
Build an BgN-Score model
and test on OOB examples
1.56
1.04
3.17
OOBMSE
An example of a
parameter set:
H = 23, p = 15 , λ= 0.031
Choose the parameter
set that corresponds
to the minimum
OOBMSE
Parameter Tuning: BsN-Score
23
Optimal parameters:
H~20, p ~ P/3, λ ~ 0.001 , N ~ 2000
Training
BsN-Score 1
Training
BsN-Score 4
Training
BsN-Score 10Parameter Set 1
Parameter Set i
Parameter Set θ
Generated
Parameter Sets
Build 10 BsN-Score models and test on
their respective validation examples
1.56
1.04
3.17
Average CV MSE
An example of a
parameter set:
H = 23, p = 15 , λ= 0.031
Choose the parameter
set that corresponds
to the minimum CV
MSE
Validation
Validation
Validation
24
0.644
0.657
0.804
0.816
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
SYBYL::F-Score
SYBYL::PMF-Score
GOLD::GoldScore
DS::Jain
SYBYL::D-Score
GOLD::ChemScore
DS::PMF
GlidScore-XP
DS::LigScore2
DS::LUDI3
SYBYL::G-Score
GOLD::ASP
DS::PLP1
SYBYL::ChemScore
DrugScoreCSD
X-Score::HMScore
SNN-Score::X
BgN-Score::XARG
BsN-Score::XARG
Correlation Coefficient Rp
ScoringFunctionsEnsemble NN vs. Conventional SFs on Diverse Complexes
Ensemble NN vs. Conventional SFs on HIV Complexes
25
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
CorrelationCoefficientRp
Scoring Functions
Disjoint Training and Test
Complexes
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
CorrelationCoefficientRp
Scoring Functions
Overlapping Training and Test
Complexes
Ensemble NN vs. Conventional SFs on Trypsin Complexes
26
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
CorrelationCoefficientRp
Scoring Functions
Disjoint Training and Test
Complexes
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
CorrelationCoefficientRp
Scoring Functions
Overlapping Training and Test
Complexes
Ensemble NN vs. Conventional SFs on Carbonic Anhydrase Cmpxs.
27
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
CorrelationCoefficientRp
Scoring Functions
Disjoint Training and Test
Complexes
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
CorrelationCoefficientRp
Scoring Functions
Overlapping Training and Test
Complexes
Ensemble NN vs. Conventional SFs on Thrombin Complexes
28
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
CorrelationCoefficientRp
Scoring Functions
Disjoint Training and Test
Complexes
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
CorrelationCoefficientRp
Scoring Functions
Overlapping Training and Test
Complexes
Conclusion
29
Concluding Remarks
30
BsN-Score & BgN-Score are the most accuare SFs
BsN-Score & BgN-Score are ~20% more accurate (0.804
& 0.816 vs. 0.675) compared to SNN-Score
BsN-Score & BgN-Score are ~25% more accurate (0.804
& 0.816 vs. 0.644) compared to the best conventional
SF, X-Score.
Moreover, their accuracies are even higher when they
are used to predict BAs of protein-ligand complexes
that are related to their training sets.
Future Work
31
Collect more PLC from other databases
Consider other techniques to extract more descriptors
Analyze variable importance and descriptor interactions
Consider other types & topologies of ANNs such as
Recurrent NNs and Deep NNs.
Thank You!
32

More Related Content

Viewers also liked

Poggi analytics - ensamble - 1b
Poggi   analytics - ensamble - 1bPoggi   analytics - ensamble - 1b
Poggi analytics - ensamble - 1bGaston Liberman
 
New ensemble methods for evolving data streams
New ensemble methods for evolving data streamsNew ensemble methods for evolving data streams
New ensemble methods for evolving data streamsAlbert Bifet
 
Machine Learning and Data Mining: 16 Classifiers Ensembles
Machine Learning and Data Mining: 16 Classifiers EnsemblesMachine Learning and Data Mining: 16 Classifiers Ensembles
Machine Learning and Data Mining: 16 Classifiers EnsemblesPier Luca Lanzi
 
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting MachinesDecision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting MachinesDeepak George
 
DMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification EnsemblesDMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification EnsemblesPier Luca Lanzi
 
RapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid MinerRapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid MinerDataminingTools Inc
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and predictionDataminingTools Inc
 

Viewers also liked (8)

Poggi analytics - ensamble - 1b
Poggi   analytics - ensamble - 1bPoggi   analytics - ensamble - 1b
Poggi analytics - ensamble - 1b
 
Tree advanced
Tree advancedTree advanced
Tree advanced
 
New ensemble methods for evolving data streams
New ensemble methods for evolving data streamsNew ensemble methods for evolving data streams
New ensemble methods for evolving data streams
 
Machine Learning and Data Mining: 16 Classifiers Ensembles
Machine Learning and Data Mining: 16 Classifiers EnsemblesMachine Learning and Data Mining: 16 Classifiers Ensembles
Machine Learning and Data Mining: 16 Classifiers Ensembles
 
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting MachinesDecision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
 
DMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification EnsemblesDMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification Ensembles
 
RapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid MinerRapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid Miner
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 

Similar to Prib2014

Automation of building reliable models
Automation of building reliable modelsAutomation of building reliable models
Automation of building reliable modelsEszter Szabó
 
Translating data to predictive models
Translating data to predictive modelsTranslating data to predictive models
Translating data to predictive modelsChemAxon
 
Promise 2011: "Customization Support for CBR-Based Defect Prediction"
Promise 2011: "Customization Support for CBR-Based Defect Prediction"Promise 2011: "Customization Support for CBR-Based Defect Prediction"
Promise 2011: "Customization Support for CBR-Based Defect Prediction"CS, NcState
 
Translating data to model ICCS2022_pub.pdf
Translating data to model ICCS2022_pub.pdfTranslating data to model ICCS2022_pub.pdf
Translating data to model ICCS2022_pub.pdfwhitecomma
 
Effective data pre-processing for AutoML
Effective data pre-processing for AutoMLEffective data pre-processing for AutoML
Effective data pre-processing for AutoMLJoseph Giovanelli
 
Probabilistic Collaborative Filtering with Negative Cross Entropy
Probabilistic Collaborative Filtering with Negative Cross EntropyProbabilistic Collaborative Filtering with Negative Cross Entropy
Probabilistic Collaborative Filtering with Negative Cross EntropyAlejandro Bellogin
 
A scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringA scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringAllenWu
 
ANN in System Biology
ANN in System Biology ANN in System Biology
ANN in System Biology Hajra Qayyum
 
ES-Rank: Evolutionary Strategy Learning to Rank Approach (Presentation)
ES-Rank: Evolutionary Strategy Learning to Rank Approach (Presentation)ES-Rank: Evolutionary Strategy Learning to Rank Approach (Presentation)
ES-Rank: Evolutionary Strategy Learning to Rank Approach (Presentation)Minia University, Egypt
 
Model Selection Using Conformal Predictors
Model Selection Using Conformal PredictorsModel Selection Using Conformal Predictors
Model Selection Using Conformal PredictorsAbhay Gupta
 
Prediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical StructurePrediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical StructureJeremy Besnard
 
RBHF_SDM_2011_Jie
RBHF_SDM_2011_JieRBHF_SDM_2011_Jie
RBHF_SDM_2011_JieMDO_Lab
 
Data Mining Protein Structures' Topological Properties to Enhance Contact Ma...
Data Mining Protein Structures' Topological Properties  to Enhance Contact Ma...Data Mining Protein Structures' Topological Properties  to Enhance Contact Ma...
Data Mining Protein Structures' Topological Properties to Enhance Contact Ma...jaumebp
 
IDA 2015: Efficient model selection for regularized classification by exploit...
IDA 2015: Efficient model selection for regularized classification by exploit...IDA 2015: Efficient model selection for regularized classification by exploit...
IDA 2015: Efficient model selection for regularized classification by exploit...George Balikas
 
Bioactivity Predictive ModelingMay2016
Bioactivity Predictive ModelingMay2016Bioactivity Predictive ModelingMay2016
Bioactivity Predictive ModelingMay2016Matthew Clark
 
Thesis presentation: Applications of machine learning in predicting supply risks
Thesis presentation: Applications of machine learning in predicting supply risksThesis presentation: Applications of machine learning in predicting supply risks
Thesis presentation: Applications of machine learning in predicting supply risksTuanNguyen1697
 
[IUI 2017] Criteria Chains: A Novel Multi-Criteria Recommendation Approach
[IUI 2017] Criteria Chains: A Novel Multi-Criteria Recommendation Approach[IUI 2017] Criteria Chains: A Novel Multi-Criteria Recommendation Approach
[IUI 2017] Criteria Chains: A Novel Multi-Criteria Recommendation ApproachYONG ZHENG
 
APPLICATION OF STATISTICAL LEARNING TECHNIQUES AS PREDICTIVE TOOLS FOR MACHIN...
APPLICATION OF STATISTICAL LEARNING TECHNIQUES AS PREDICTIVE TOOLS FOR MACHIN...APPLICATION OF STATISTICAL LEARNING TECHNIQUES AS PREDICTIVE TOOLS FOR MACHIN...
APPLICATION OF STATISTICAL LEARNING TECHNIQUES AS PREDICTIVE TOOLS FOR MACHIN...Shibaprasad Bhattacharya
 

Similar to Prib2014 (20)

Automation of building reliable models
Automation of building reliable modelsAutomation of building reliable models
Automation of building reliable models
 
Translating data to predictive models
Translating data to predictive modelsTranslating data to predictive models
Translating data to predictive models
 
Promise 2011: "Customization Support for CBR-Based Defect Prediction"
Promise 2011: "Customization Support for CBR-Based Defect Prediction"Promise 2011: "Customization Support for CBR-Based Defect Prediction"
Promise 2011: "Customization Support for CBR-Based Defect Prediction"
 
Translating data to model ICCS2022_pub.pdf
Translating data to model ICCS2022_pub.pdfTranslating data to model ICCS2022_pub.pdf
Translating data to model ICCS2022_pub.pdf
 
Effective data pre-processing for AutoML
Effective data pre-processing for AutoMLEffective data pre-processing for AutoML
Effective data pre-processing for AutoML
 
ds2010
ds2010ds2010
ds2010
 
Probabilistic Collaborative Filtering with Negative Cross Entropy
Probabilistic Collaborative Filtering with Negative Cross EntropyProbabilistic Collaborative Filtering with Negative Cross Entropy
Probabilistic Collaborative Filtering with Negative Cross Entropy
 
A scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringA scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clustering
 
ANN in System Biology
ANN in System Biology ANN in System Biology
ANN in System Biology
 
ES-Rank: Evolutionary Strategy Learning to Rank Approach (Presentation)
ES-Rank: Evolutionary Strategy Learning to Rank Approach (Presentation)ES-Rank: Evolutionary Strategy Learning to Rank Approach (Presentation)
ES-Rank: Evolutionary Strategy Learning to Rank Approach (Presentation)
 
Model Selection Using Conformal Predictors
Model Selection Using Conformal PredictorsModel Selection Using Conformal Predictors
Model Selection Using Conformal Predictors
 
Prediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical StructurePrediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical Structure
 
RBHF_SDM_2011_Jie
RBHF_SDM_2011_JieRBHF_SDM_2011_Jie
RBHF_SDM_2011_Jie
 
Prediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source toolsPrediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source tools
 
Data Mining Protein Structures' Topological Properties to Enhance Contact Ma...
Data Mining Protein Structures' Topological Properties  to Enhance Contact Ma...Data Mining Protein Structures' Topological Properties  to Enhance Contact Ma...
Data Mining Protein Structures' Topological Properties to Enhance Contact Ma...
 
IDA 2015: Efficient model selection for regularized classification by exploit...
IDA 2015: Efficient model selection for regularized classification by exploit...IDA 2015: Efficient model selection for regularized classification by exploit...
IDA 2015: Efficient model selection for regularized classification by exploit...
 
Bioactivity Predictive ModelingMay2016
Bioactivity Predictive ModelingMay2016Bioactivity Predictive ModelingMay2016
Bioactivity Predictive ModelingMay2016
 
Thesis presentation: Applications of machine learning in predicting supply risks
Thesis presentation: Applications of machine learning in predicting supply risksThesis presentation: Applications of machine learning in predicting supply risks
Thesis presentation: Applications of machine learning in predicting supply risks
 
[IUI 2017] Criteria Chains: A Novel Multi-Criteria Recommendation Approach
[IUI 2017] Criteria Chains: A Novel Multi-Criteria Recommendation Approach[IUI 2017] Criteria Chains: A Novel Multi-Criteria Recommendation Approach
[IUI 2017] Criteria Chains: A Novel Multi-Criteria Recommendation Approach
 
APPLICATION OF STATISTICAL LEARNING TECHNIQUES AS PREDICTIVE TOOLS FOR MACHIN...
APPLICATION OF STATISTICAL LEARNING TECHNIQUES AS PREDICTIVE TOOLS FOR MACHIN...APPLICATION OF STATISTICAL LEARNING TECHNIQUES AS PREDICTIVE TOOLS FOR MACHIN...
APPLICATION OF STATISTICAL LEARNING TECHNIQUES AS PREDICTIVE TOOLS FOR MACHIN...
 

Recently uploaded

Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)Data & Analytics Magazin
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 

Recently uploaded (17)

Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 

Prib2014

  • 1. DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Advanced Circuits, Architecture, and Computing Lab BgN-Score & BsN-Score: Bagging & Boosting Based Ensemble Neural Networks Scoring Functions for Accurate Binding Affinity Prediction of Protein-Ligand Complexes Hossam M. Ashtawy ashtawy@egr.msu.edu The 9th IAPR conference on Pattern Recognition in Bioinformatics (PRIB 2014) August 21, 2014 Nihar R. Mahapatra nrm@egr.msu.edu Department of Electrical & Computer Engineering Michigan State University, East Lansing, MI, U.S.A. © 2014
  • 2. Motivation 2 BA is the principal determinant of many vital biological processes Accurate prediction of BA is challenging & remains unresolved prb. Conventional SFs have limited predictive power
  • 3. Outline 3 •Scoring Functions •Our Approach Background & Scope of Work •Compound Characterization •Ensemble Neural Networks Materials & Methods •SFs’ Tuning, Training, & Testing •SFs’ evaluation & comparison Experiments & Discussion
  • 6. Scoring Challenges 6 Lack of accurate accounting of intermolecular physicochemical interactions More descriptors = Curse of dimensionality Relationship between descriptors & BA could be highly nonlinear
  • 7. Multi-Layer Neural Network 7  Theoretically, it can model any nonlinear continuous function; however:  Hard to tune its weights to optimal values  Does not handle high dimensional data well  Has high variance errors  Is a black box model  lacks descriptive power
  • 8. Our Approach & Scope of Work 8 Collect large number of PLCs with known BA Extract a diverse set of features Train ensemble NN models based on boosting and bagging Evaluate resulting SFs on diverse and homogeneous protein families
  • 10. Compound Database: PDBbind [3] 10  Protein-ligand complexes obtained from PDBbind 2007  PDBbind is a selective compilation of the Protein Data Bank (PDB) database
  • 11. Compound Database: PDBbind 11 1a30 9hvp1d5j Protein-ligand complexes from PDB PDB key 1a30 1f9g 2qtg 2usn 1f9g 2qtg PDBbind filters and Binding Affinity Collection 2usn Feature Calculation (using X-Score, Affi-Score, RF-Score, and Gold tools) ExperimentalData
  • 12. PDB Ligand’s MW ≤ 1000 # non-hydrogen atoms of the ligand ≥ 6 Only one ligand is bound to the protein Protein & ligand non- covalently bound Resolution of the complex crystal structure ≤ 2.5Å Elements in complex must be C, N, O, P, S, F, Cl, Br, I, H Known Kd or Ki Hydrogenation Protonation & deprotonation Refined set of PDBbind PDBbind: Refined Set 12
  • 13. PDBbind: Core Set 13 Refined set Similarity search using BLAST Similarity cutoff of 90% Clusters with ≥ 4 complexes Binding affinity of highest- affinity complex is 100-fold the affinity of lowest one First, middle, and lowest affinity complexes from each cluster Core set of PDBbind
  • 14.  Extracted features calculated for the following scoring functions: X-Score (6 features) AffiScore (30 features). RF-Score (36 features) GOLD (14 features) Compound Characterization 14 1a30 9hvp1d5j Protein-ligand complexes from PDB PDB key 1a30 Training and Test Datasets 1f9g 2qtg 2usn 1f9g 2qtg PDBbind filters and Binding Affinity Collection 2usn Feature Calculation (using X-Score, Affi-Score, RF-Score, and Gold tools) ExperimentalData
  • 15.  Primary training set (1105): Pr  Core test set (195): Cr Training and Test Datasets 15 1a30 Training and Test Datasets 1f9g 2qtg 2usn Primary Training Dataset Pr Core Test Dataset Cr Feature Calculation (using X-Score, Affi-Score, RF-Score, and Gold tools) Ensemble NN boosting & bagging Algorithms Test Complex to Score ExperimentalDat X Dataset A Dataset R Dataset G Dataset XA Dataset ...XARG Dataset X Dataset A Dataset R Dataset G Dataset XA Dataset ...XARG Dataset
  • 16. Base Learner: A Neural Network 16  Prediction of each network is calculated as follows:  Network weights are optimized to minimize the fitting criterion E: Input layer Hidden layer Output layer wh,owi,h +1 x1 x2 xP Bindingaffinity +1 Featuresofacomplex
  • 17. BgN-Score 17  An ensemble of MLP ANNs grown  Inputs to each ANN are a random subset of p features  Each ANN trained on a bootstrap dataset randomly sampled with replacement from training data  After building the ensemble model, the BA of a new protein-ligand complex X is computed by applying the formula: wh,owi,h +1 x3 x21 x13 Bindingaffinity +1 wh,owi,h +1 x8 x51 x6 +1 wh,owi,h +1 x6 x2 x37 +1 Featuresofacomplex Average
  • 19. Conventional SFs 19 Empirical SFs (9) DS::PLP DS::Jain DS::Ludi GLIDE::GlideScore SYBYL::ChemScore SYBYL::F-Score GOLD::ChemScore GOLD::ASP X-Score Knowledge Based SFs (4) DS::LigScore DS::PMF SYBYL::PMF DrugScore Force-field SFs (3) SYBYL::D-Score SYBYL::G-Score GOLD::GoldScore
  • 21. SF Construction & Application Workflow 21 Scoring Function Building and Evaluation Collecting Data Feature Generation Training Set Formation Model Building BsN-Score & BgN-Score Binding Affinity Protein 3D structure Ligand Feature Generation Build Validate Parameter Tuning Training Data Optimal Parameters
  • 22. Parameter Tuning: BgN-Score 22 Optimal parameters: H~20, p ~ P/3, λ ~ 0.001 , N ~ 2000 Training Net. 1 Training Net. 851 Training Net. 2000Parameter Set 1 Parameter Set i Parameter Set θ Generated Parameter Sets Build an BgN-Score model and test on OOB examples 1.56 1.04 3.17 OOBMSE An example of a parameter set: H = 23, p = 15 , λ= 0.031 Choose the parameter set that corresponds to the minimum OOBMSE
  • 23. Parameter Tuning: BsN-Score 23 Optimal parameters: H~20, p ~ P/3, λ ~ 0.001 , N ~ 2000 Training BsN-Score 1 Training BsN-Score 4 Training BsN-Score 10Parameter Set 1 Parameter Set i Parameter Set θ Generated Parameter Sets Build 10 BsN-Score models and test on their respective validation examples 1.56 1.04 3.17 Average CV MSE An example of a parameter set: H = 23, p = 15 , λ= 0.031 Choose the parameter set that corresponds to the minimum CV MSE Validation Validation Validation
  • 24. 24 0.644 0.657 0.804 0.816 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 SYBYL::F-Score SYBYL::PMF-Score GOLD::GoldScore DS::Jain SYBYL::D-Score GOLD::ChemScore DS::PMF GlidScore-XP DS::LigScore2 DS::LUDI3 SYBYL::G-Score GOLD::ASP DS::PLP1 SYBYL::ChemScore DrugScoreCSD X-Score::HMScore SNN-Score::X BgN-Score::XARG BsN-Score::XARG Correlation Coefficient Rp ScoringFunctionsEnsemble NN vs. Conventional SFs on Diverse Complexes
  • 25. Ensemble NN vs. Conventional SFs on HIV Complexes 25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CorrelationCoefficientRp Scoring Functions Disjoint Training and Test Complexes 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CorrelationCoefficientRp Scoring Functions Overlapping Training and Test Complexes
  • 26. Ensemble NN vs. Conventional SFs on Trypsin Complexes 26 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CorrelationCoefficientRp Scoring Functions Disjoint Training and Test Complexes 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CorrelationCoefficientRp Scoring Functions Overlapping Training and Test Complexes
  • 27. Ensemble NN vs. Conventional SFs on Carbonic Anhydrase Cmpxs. 27 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CorrelationCoefficientRp Scoring Functions Disjoint Training and Test Complexes 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CorrelationCoefficientRp Scoring Functions Overlapping Training and Test Complexes
  • 28. Ensemble NN vs. Conventional SFs on Thrombin Complexes 28 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CorrelationCoefficientRp Scoring Functions Disjoint Training and Test Complexes 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CorrelationCoefficientRp Scoring Functions Overlapping Training and Test Complexes
  • 30. Concluding Remarks 30 BsN-Score & BgN-Score are the most accuare SFs BsN-Score & BgN-Score are ~20% more accurate (0.804 & 0.816 vs. 0.675) compared to SNN-Score BsN-Score & BgN-Score are ~25% more accurate (0.804 & 0.816 vs. 0.644) compared to the best conventional SF, X-Score. Moreover, their accuracies are even higher when they are used to predict BAs of protein-ligand complexes that are related to their training sets.
  • 31. Future Work 31 Collect more PLC from other databases Consider other techniques to extract more descriptors Analyze variable importance and descriptor interactions Consider other types & topologies of ANNs such as Recurrent NNs and Deep NNs.

Editor's Notes

  1. We used the same complex database that Cheng et al used as a benchmark in their comparative assessment of 16 popular SFs. This DB, PDBbind, is a popular benchmark that has been cited and used to evaluate SFs in hundreds of other studies (from google scholar). PDBbind is a high-quality and comprehensive compilation of biomolecular complexes deposited in the Protein Data Bank (PDB).
  2. [The slide itself has the talk.]
  3. Boosting is an ensemble machine-learning technique based on a stage-wise fitting of base learners. The technique attempts to minimize the overall loss by boosting the complexes having highest predicted errors, i.e., by fitting NNs to (accumulated) residuals made by previous networks in the ensemble model. The algorithm starts by fitting the first network to all training complexes. A small fraction (ν < 1) of the first network’s predictions is used to calculate the first iteration of residuals Y1. The network f1 is the first term in the boosting additive model. In each subsequent stage, a network is trained on a bootstrap sample of the training complexes described by a random subset of p < P features. The values of the dependent variable of the training data for the network l are the current residuals corresponding to the sampled protein ligand complexes. The residuals for each network are the differences between previous residuals and a small fraction of its predicted errors. This fraction is controlled by the shrinkage parameter ν < 1 to avoid any overfitting. Network generation continues as long as their number does not exceed a predefined limit L. Each network joins the ensemble with a shrunk version of itself. In our experiments, we fixed the shrinkage parameter to 0.001 which gave the lowest out-of-sample error. The final prediction of a PLC x^P is : [read the formula given in the slide]
  4. A total of sixteen popular SFs are compared to NN SFs in this study. The sixteen SFs are either used in mainstream commercial docking tools and/or have been developed in academia. The functions were recently compared against each other in a study conducted by Cheng et al. The set includes 9 Empirical SFs, 4 Knowledge-based SFs, and 3 Force-field SFs.