Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models

662 Aufrufe

Veröffentlicht am

Talk given to computational toxicology group at the EPA 23 Sept 2015 - Alex Clark co-author

Veröffentlicht in: Wissenschaft
  • Login to see the comments

Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models

  1. 1. Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models Sean Ekins1,2,3* and Alex M. Clark4 1 Collaborations Pharmaceuticals, Inc., 5616 Hilltop Needmore Road, Fuquay-Varina, NC 27526, USA 2 Collaborative Drug Discovery, 1633 Bayshore Highway, Suite 342, Burlingame, CA 94010, USA 3 Collaborations in Chemistry, 5616 Hilltop Needmore Road, Fuquay-Varina, NC 27526, USA 4 Molecular Materials Informatics, 1900 St. Jacques #302, Montreal H3J 2S1, Quebec, Canada Disclosure: As well as employee of above funded by NIH, EC FP7, consultant for several rare disease foundations, drug companies and consumer product companies etc.
  2. 2. Laboratories past and present Lavoisier’s lab 18th C Edison’s lab 20th C Author’s lab 21th C + Network of global collaborators
  3. 3. "Rub al Khali 002" by Nepenthes The chemistry/ biology data desert outside of pharma circa early 2000’s Limited ADME/Tox data Paucity of Structure Activity Data Small datasets for modeling Drug companies – gate keepers of information for drug discovery
  4. 4. "Oasis in Libya" by Sfivat The growing chemistry/biology data Oasis outside of pharma circa 2015
  5. 5. ADME/Tox models 15 yrs on: Then & Now • Datasets very small < 100 cpds • Heavy focus on P450 • Models rarely used • Very limited number of properties addressed • Few tools / agorithms used • Limited access to models • Much bigger datasets > 1000s cpds >10,000 • Broader range of models • Models more widely used and reported • More accessible models • Pharma making data available  70 hERG models (Villoutreix and Taboroureau 2015)  19 protein binding models (Lambrinidis et al 2015)  40 BBB models upto 2009
  6. 6. Model resources for ADME/Tox
  7. 7. CYP 1A2 2C9 2C19 Substrate (mM) phenacetin (10) diclofenac (10) omeprazole (0.5) Inhibitor naphthoflavone sulfaphenazole tranylcypromine Compounds IC50 (mM) IC50 (mM) IC50 (mM) JSF-2019 2.25 3.55 10.8 Retinal dehydrogenase 1 ADME SARfari predicts importance of CYP1A2, CYP2C9, CYP2C19 The Naïve Bayes model was built with 142345 compounds (training and validation) and features 135 learned classes. Testing by Dr. Joel Freundlich
  8. 8. Just a matter of scale? Drug Discovery’s definition of Big data Everyone else’s definition of Big data
  9. 9. • Data Sources • PubChem • ChEMBL • ToxCast over 1800 molecules tested against over 800 endpoints Where can we get the datasets
  10. 10. Open source – but much smaller 400 diverse, drug-like molecules active against neglected diseases 400 cpds from around 20,000 hits generated screening campaign ~ four million compounds from the libraries of St. Jude Children's Research Hospital, TN, USA, Novartis and GSK. Many screens completed
  11. 11. Bigger datasets and model collections • Profiling “big datasets” is going to be the norm. • A recent study mined PubChem datasets for compounds that have rat in vivo acute toxicity data • This could be used in other big data initiatives like ToxCast (> 1000 compounds x 800 assays) and Tox21 etc. • Kinase screening data (1000s mols x 100s assays) • GPCR datasets etc (1000s mols x 100s assays) Zhang J, Hsieh JH, Zhu H (2014) Profiling Animal Toxicants by Automatically Mining Public Bioassay Data: A Big Data Approach for Computational Toxicology. PLoS ONE 9(6): e99863. doi:10.1371/journal.pone.0099863 http://127.0.0.1:8081/plosone/article?id=info:d oi/10.1371/journal.pone.0099863
  12. 12. ‘Bigger’ and not ‘Big’
  13. 13. (220463) (102633) (23797) (346893) (2273) (1783) (1248) (5304) (218640) (102634) (23737) (345011) 1771924 Are bigger models better for tuberculosis ? Ekins et al., J Chem Inf Model 54: 2157-2165 (2014)
  14. 14. No relationship between internal or external ROC and the number of molecules in the training set? PCA of combined data and ARRA(red) Ekins et al., J Chem Inf Model 54: 2157-2165 (2014) Internal and leave out 50%x100 ROC track each other External ROC less correlation Smaller models do just as well with external testing ~350,000
  15. 15. The Opportunity •Get pharmas to use open source molecular descriptors and algorithms •Benefit from initial work done by Pfizer/CDD •Avoid repetition of open source tools vs commercial tools comparisons •Change the mindset from real data to virtual data – confirm predictions •ADME/Tox is precompetitive •Expand the chemical space and predictivity of models •Share models with collaborators – Companies could share data as models Ekins and Williams, Lab On A Chip, 10: 13-22, 2010. Gupta RR, et al., Drug Metab Dispos, 38: 2083-2090, 2010
  16. 16. Pfizer Open models and descriptors Gupta RR, et al., Drug Metab Dispos, 38: 2083-2090, 2010 • What can be developed with very large training and test sets? • HLM training 50,000 testing 25,000 molecules • training 194,000 and testing 39,000 • MDCK training 25,000 testing 25,000 • MDR training 25,000 testing 18,400 • Open molecular descriptors / models vs commercial descriptors
  17. 17. • Examples – Metabolic Stability Gupta RR, et al., Drug Metab Dispos, 38: 2083-2090, 2010 HLM Model with CDK and SMARTS Keys: HLM Model with MOE2D and SMARTS Keys # Descriptors: 578 Descriptors # Training Set compounds: 193,650 Cross Validation Results: 38,730 compounds Training R2: 0.79 20% Test Set R2: 0.69 Blind Data Set (2310 compounds): R2 = 0.53 RMSE = 0.367 Continuous  Categorical: κ = 0.40 Sensitivity = 0.16 Specificity = 0.99 PPV = 0.80 Time (sec/compound): 0.252 # Descriptors: 818 Descriptors # Training Set compounds: 193,930 Cross Validation Results: 38,786 compounds Training R2: 0.77 20% Test Set R2: 0.69 Blind Data Set (2310 compounds): R2 = 0.53 RMSE = 0.367 Continuous  Categorical: κ = 0.42 Sensitivity = 0.24 Specificity = 0.987 PPV = 0.823 Time (sec/compound): 0.303 PCA of training (red) and test (blue) compounds Overlap in Chemistry space
  18. 18. • Examples – P-gp Gupta RR, et al., Drug Metab Dispos, 38: 2083-2090, 2010 Open source descriptors CDK and C5.0 algorithm ~60,000 molecules with P-gp efflux data from Pfizer MDR <2.5 (low risk) (N = 14,175) MDR > 2.5 (high risk) (N = 10,820) Test set MDR <2.5 (N = 10,441) > 2.5 (N = 7972) Could facilitate model sharing? CDK +fragment descriptors MOE 2D +fragment descriptors Kappa 0.65 0.67 sensitivity 0.86 0.86 specificity 0.78 0.8 PPV 0.84 0.84
  19. 19. MoDELS RESIDE IN PAPERS NOT ACCESSIBLE…THIS IS UNDESIRABLE How do we share them? How do we use Them?
  20. 20. Open ExtendedConnectivity Fingerprints ECFP_6 FCFP_6 • Collected, deduplicated, hashed • Sparse integers • Invented for Pipeline Pilot: public method, proprietary details • Often used with Bayesian models: many published papers • Built a new implementation: open source, Java, CDK – stable: fingerprints don't change with each new toolkit release – well defined: easy to document precise steps – easy to port: already migrated to iOS (Objective-C) for TB Mobile app • Provides core basis feature for CDD open source model service Clark et al., J Cheminform 6:38 2014
  21. 21. Uses Bayesian algorithm and FCFP_6 fingerprints Bayesian models Clark et al., J Cheminform 6:38 2014
  22. 22. Exporting models from CDD Clark et al., JCIM 55: 1231-1245 (2015)
  23. 23. Machine Learning – Different tools • Models generated using : molecular function class fingerprints of maximum diameter 6 (FCFP_6), AlogP, molecular weight, number of rotatable bonds, number of rings, number of aromatic rings, number of hydrogen bond acceptors, number of hydrogen bond donors, and molecular fractional polar surface area. • Models were validated using five-fold cross validation (leave out 20% of the database). • Bayesian, Support Vector Machine and Recursive Partitioning Forest and single tree models built. • RP Forest and RP Single Tree models used the standard protocol in Discovery Studio. • 5-fold cross validation or leave out 50% x 100 fold cross validation was used to calculate the ROC for the models generated • *fingerprints only Ai et al., ADDR 86: 46-60, 2015 KCNQ1
  24. 24. Ames Bayesian model built with 6512 molecules (Hansen et al., 2009) Features important for Ames actives. Features important for Ames inactives.
  25. 25. Ames Bayesian model built using CDD Models showing ROC for 3 fold cross validation. Note only FCFP_6 descriptors were used
  26. 26. FCFP6 fingerprint models in CDD Clark et al., JCIM 55: 1231-1245 (2015)
  27. 27. ECFP6 fingerprint only models in MMDS Clark et al., JCIM 55: 1231-1245 (2015)
  28. 28. Using AZ-ChEMBL data for CDD Models
  29. 29. • Human microsomal intrinsic clearance • Rat hepatocyte intrinsic clearance
  30. 30. What if the models were already built for you • Instead of having to go into a database and find data • The models are already prebuilt • Ready to use • Shareable • Create a repository of models
  31. 31. Previous work by others • Using large datasets to predict targets with Bayesian algorithm • Bayesian classifier - 698 target models (> 200,000 molecules, 561,000 measurements) Paolini et al 2006 • 246 targets (65,241 molecules) Similarity ensemble analysis Keiser et al 2007 • 2000 targets (167,000 molecules) target identification from zebrafish screen Laggner et al 2012 • 70 targets (100,269 data points) Bender et al 2007 • Many others….. • None of these enable you qualitatively or quantitatively predict activity for a single target.
  32. 32. Recent Studies • Bit folding – trade off between performance & efficacy • Model cut-off selection for cross validation • Scalability of ECFP6 and FCFP6 using ChEMBL 20 mid size datasets • CDK codebase on Github (http://github.com/cdk/cdk: look for class org.open-science.cdk.fingerprint.model.Bayesian ) • Made the models accessible http://molsync.com/bayesian2 Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60
  33. 33. What do 2000 ChEMBL models look like Folding bit size Average ROC http://molsync.com/bayesian2 Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60
  34. 34. ChEMBL 20 • Skipped targets with > 100,000 assays and sets with < 100 measurements • Converted data to –log • Dealt with duplicates • 2152 datasets • Cutoff determination • Balance active/ inactive ratio • Favor structural diversity and activity distribution Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60
  35. 35. Desirability score • ROC integral for model using subset of molecules and threshold for partitioning active / inactive (higher is better) • Second derivative of population interpolated from the current threshold (lower is better) • Ratio of actives to inactives if the collection partitioned (actives+1) / (inactives+1) or reciprocal..whichever greater Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60 Ekins et al Drug Metab Dispos 43(10):1642-5, 2015
  36. 36. Models from ChEMBL data http://molsync.com/bayesian2 Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60
  37. 37. Results • Bit folding – plateau at 4096, can use 1024 with little degredation • Cut off – works well • Evaluated balanced training: test and diabolical were test and training sets are structurally different Easy ROC 0.83 ± 0.11 Hard ROC 0.39 ± 0.23 Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60
  38. 38. Models in mobile app • Added atom coloring using ECFP6 fingerprints • Red and green high and low probability of activity, respectively Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60
  39. 39. Results for Bayesian model cross validation. 5-fold and Leave one out (LOO) validation with Bayesian models generated with Discovery Studio and Open Models implemented in the mobile app MMDS. * = previously published Ekins et al Drug Metab Dispos 43(10):1642-5, 2015 Transporter models
  40. 40. Ekins et al Drug Metab Dispos 43(10):1642-5, 2015 Transporter models
  41. 41. ToxCast data • Few studies use the ToxCast data for machine learning • Recent reviews Sipes et al., Chem Res Toxicol. 2013 Jun 17; 26(6): 878–895. • Liu et al., Chem Res Toxicol. 2015 Apr 20;28(4):738-51 • A set of 677 chemicals was represented by 711 in vitro bioactivity descriptors • (from ToxCast assays), 4,376 chemical structure descriptors (from QikProp, OpenBabel, PaDEL, and PubChem), and three hepatotoxicity categories • six machine learning algorithms: linear discriminant analysis (LDA), Naïve Bayes (NB), support vector machines (SVM), classification and regression trees (CART), k- nearest neighbors (KNN), and an ensemble of these classifiers (ENSMB)from animal studies) • nuclear receptor activation and mitochondrial functions were frequently found in highly predictive classifiers of hepatotoxicity • CART, ENSMB, and SVM classifiers performed the best
  42. 42. CDD Models for human P450s (NVS data) from ToxCast (n=1787) <1uM cutoff CYP1A1 CYP1A2 CYP2B6 CYP2C18 CYP2C19 CYP2C9 CYP3A4 CYP3A5
  43. 43. ToxCast models in a mobile app IC50 1A2 = 2.25 uM IC50 2C9 = 3.55 uM IC50 2C19 = 10.8 uM In vitro data Courtesy Dr. Joel Freundlich
  44. 44. PolyPharma a new free app for drug discovery
  45. 45. Composite models - Binned Bayesians Clark et al., Submitted 2015
  46. 46. Summary • Shown that open source models/ descriptors comparable to previously published models with commercial software • Implemented Bayesian machine learning in CDD Vault • Can be used on private or public data • Can enable sharing of models in CDD Vault • Enabled export of models – can use models in 3rd part mobile apps or other tools • Demonstrated various ADME/Tox models and transporters • Make ToxCast data into models that can be used by anyone • Provide more information on models and predictions • Visualize training set molecules vs test compounds • Use a model to predict compounds and then test them
  47. 47. Future ? + = Big Models Thousands of Big Models How do you validate 1000’s of models How do algorithms hands 500K – 1M molecules Need new algorithms, data visualization, mining approaches Model sharing is here Need for broad Biology & Chemistry knowledge – open minds, BIG thinkers
  48. 48. Acknowledgments • Alex Clark Antony Williams • Joel Freundlich Robert Reynolds • Steven Wright • Krishna Dole and all colleagues at CDD • Award Number 9R44TR000942-02 “Biocomputation across distributed private datasets to enhance drug discovery” from the NIH National Center for Advancing Translational Sciences. • R41-AI108003-01 “Identification and validation of targets of phenotypic high throughput screening” from NIH National Institute of Allergy and Infectious Diseases • Bill and Melinda Gates Foundation (Grant#49852 “Collaborative drug discovery for TB through a novel database of SAR data optimized to promote data archiving and sharing”).
  49. 49. Software on github Models can be accessed at • http://molsync.com/bayesian1 • http://molsync.com/bayesian2 • http://molsync.com/transporters

×