Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Data for AI models, the past, the present, the future

1.050 Aufrufe

Veröffentlicht am

Prof. John Overington, the CIO of the Medicines Discovery Catapult describes the AssayNet project and its very far reaching implications.

Veröffentlicht in: Gesundheit & Medizin
  • Als Erste(r) kommentieren

Data for AI models, the past, the present, the future

  1. 1. Data for AI Models, The Past, The Present, The Future John P. Overington jpo@md.catapult.org.uk
  2. 2. © 2019 Medicines Discovery Catapult. All rights reserved. “Public data is the worst form of training data for AI except for all those other forms that have been tried from time to time” Winston Churchill, 2016
  3. 3. © 2019 Medicines Discovery Catapult. All rights reserved. National facility connecting the UK community to accelerate innovative drug discovery • Independent not-for-profit organisation • Part of the U.K.’s Catapult network • Helping to deliver the U.K.’s Industrial Strategy • Funded by Innovate U.K., part of UK Research and Innovation, reporting to the Department for Business, Energy & Industrial Strategy • Focus on SME and translational academic sector support MDC - Medicines Discovery Catapult
  4. 4. © 2019 Medicines Discovery Catapult. All rights reserved. ChEMBL, SureChEMBL & UniChem
  5. 5. © 2019 Medicines Discovery Catapult. All rights reserved. • Originally developed 2003 at Inpharmatica • Spun out to public domain • The world’s largest primary public database of medicinal chemistry data • ~2.3 million compounds • ~11,000 targets • ~15 million bioactivities • Truly Open Data - CC-BY-SA license • API, MyChEMBL VM, RDF, full tables download…. • Basis of vast majority of AI innovation in compound design/optimisation Gaulton et al (2012) Nucleic Acids Research Database Issue. 40 D1100-1107 ChEMBL – www.ebi.ac.uk/chembl
  6. 6. © 2019 Medicines Discovery Catapult. All rights reserved. Compound Assay Ki=4.5 nM >Thrombin MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRKGNLERECVEETCSY EEAFEALESSTATDVFWAKYTACETARTPRDKLAACLEGNCAEGLGTNYRGHVNITRSGIECQLWRS RYPHKPEINSTTHPGADLQENFCRNPDSSTTGPWCYTTDPTVRRQECSIPVCGQDQVTVAMTPRSEG SSVNLSPPLEQCVPDRGQQYQGRLAVTTHGLPCLAWASAQAKALSKHQDFNSAVQLVENFCRNPDGD EEGVWCYVAGKPGDFGYCDLNYCEEAVEEETGDGLDEDSDRAIEGRTATSEYQTFFNPRTFGSGEAD CGLRPLFEKKSLEDKTERELLESYIDGRIVEGSDAEIGMSPWQVMLFRKSPQELLCGASLISDRWVL TAAHCLLYPPWDKNFTENDLLVRIGKHSRTRYERNIEKISMLEKIYIHPRYNWRENLDRDIALMKLK KPVAFSDYIHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTANVGKGQPSVLQVVNLPIVERPVC KDSTRIRITDNMFCAGYKPDEGKRGDACEGDSGGPFVMKSPFNNRWYQMGIVSWGEGCDRDGKYGFY THVFRLKKWIQKVIDQFGE ED2=230 nM Inhibition of human Thrombin PTT (partial thromboplastin time) ChEMBL
  7. 7. © 2019 Medicines Discovery Catapult. All rights reserved. • Public chemistry patent resource • Donated by Digital Science – SureChem commercial product • Automatically extracted chemical structures from full-text patents • >18 million chemical structures • Updated daily • Full chemistry data download SureChEMBL– www.surechembl.org Papadatos et al (2016) Nucl. Acids Res Database Issue D1220-1228
  8. 8. © 2019 Medicines Discovery Catapult. All rights reserved. UniChem – www.ebi.ac.uk/unichem • Simple chemical integration service • >144 million structures from ~30 sources • URI/resource ID/Standard InChI based lookups • Available chemicals, PubChem, ZINC, real time, private • Chemical structure ‘Time Machine’ Chambers et al (2013) J. Cheminf. DOI:10.1186/1758-2946-5-3
  9. 9. © 2019 Medicines Discovery Catapult. All rights reserved. Personal Perspectives on ChEMBL • Things that worked well • Single, major visionary funder – Wellcome Trust • Focus on data content/backend not GUI • Clear License – CC-BY-SA - same license as Wikipedia content • Private/secure services • Opportunism – SureChEMBL • Open Data in ChEMBL re-invigorated cheminformatics research • Things that didn’t work so well • Community curation attempts – armchair critics • Publisher interactions – except Royal Society of Chemistry • I would do things very differently now
  10. 10. © 2019 Medicines Discovery Catapult. All rights reserved. The Reproducibility Reproducibility Crisis! Begley & Lee (2012) Nature DOI:10.1038/483531 & Prinz et al (2011) NRDD DOI:10.1038/nrd3439-c1
  11. 11. © 2019 Medicines Discovery Catapult. All rights reserved. Enhanced data model for ChEMBL can appear as ‘errors’: e.g. complexes, receptor sets, model organisms “The more complex the parameter, the more frequent the errors” Errors in ChEMBL Tiikkainen et al (2013) JCIM DOI:10.1021/ci400099q
  12. 12. © 2019 Medicines Discovery Catapult. All rights reserved. Errors in SureChEMBL Senger et al (2015) J Cheminf DOI:10.1186/s13321-015-0097-z
  13. 13. © 2019 Medicines Discovery Catapult. All rights reserved. 0.2 0.4 0.6 −4 −2 0 2 4 diff density Inter-species Assay Variability Distribution of potency differences Scatter plot of measured potencies n = 2.781 Krüger & Overington (2012) PLoS Comp. Biol. DOI:10.1371/journal.pcbi.1002333 Same compound, same end-point for rat and human orthologs pKi human pKirat diff(human, rat) norm.dens. 2 4 6 8 10 12 2 4 6 8 10 12 orthoFrame$afnty1 orthoFrame$afnty2
  14. 14. © 2019 Medicines Discovery Catapult. All rights reserved. 2 4 6 8 10 12 2 4 6 8 10 12 sampleFrame$afnty1 sampleFrame$afnty2 0.2 0.4 0.6 −4 −2 0 2 4 diffdensity pKi Assay1 pKiAssay2 diff(assay1, assay2) n = 3.000 norm.dens. Scatter plot of measured potencies Krüger & Overington (2012) PLoS Comp. Biol. DOI:10.1371/journal.pcbi.1002333 Same compound, same species, different publication Distribution of potency differences Inter-lab Assay Variability
  15. 15. © 2019 Medicines Discovery Catapult. All rights reserved. density Inter-species vs Inter-lab Variability Krüger & Overington (2012) PLoS Comp. Biol. DOI:10.1371/journal.pcbi.1002333 pKii - pKij density Inter-laboratory Inter-species
  16. 16. © 2019 Medicines Discovery Catapult. All rights reserved. Garnett et al (2012) Nature DOI:10.1371/journal.pcbi.1002333 & Barretina et al (2012) Nature DOI:10.1038/nature11003 Large-Scale Cell-line Screening Data
  17. 17. © 2019 Medicines Discovery Catapult. All rights reserved. Inconsistent Cell-line Screening Data Haibe-Kains et al (2013) Nature DOI:10.1038/nature12831 (see also Stransky et al (2015) Nature DOI:10.1038/nature15736)
  18. 18. © 2019 Medicines Discovery Catapult. All rights reserved. Primary Data – Batches and Replicates http://www.wexlerwallace.com/wp-content/uploads/2012/04/Southeast-Laborers-Health-v-Pfizer.pdf
  19. 19. © 2019 Medicines Discovery Catapult. All rights reserved. Incorrect Chemical Structures Bosutinib Voxtalisib http://cen.acs.org/articles/90/web/2012/05/Bosutinib-Buyer-Beware.html, & Overington & Wennerberg unpublished
  20. 20. © 2019 Medicines Discovery Catapult. All rights reserved. Biochemical assay Cell- based screen Functional assay Animal disease model Human clinical trial Variance – From Simple to Complex Inter study variance Number of assay variables Steady state Time dependent
  21. 21. © 2019 Medicines Discovery Catapult. All rights reserved. The Present
  22. 22. © 2019 Medicines Discovery Catapult. All rights reserved. MDC Collaborating With The Sector
  23. 23. © 2019 Medicines Discovery Catapult. All rights reserved. DeepADMET • DeepADMET – InnovateUK grant • Optibrium Ltd. • Intellegens Ltd. • Medicines Discovery Catapult • MDC engineering software pipeline to supply ‘SAR data on demand’ • Flexible wrt document source • Fast and responsive • Significantly boost public/internal data • Deliver provenanced activity ‘vectors’ • Develop broader range of robust ADMET models using deep learning Document gathering NLP / NER Data Extraction & Heuristics SAR vectors
  24. 24. © 2019 Medicines Discovery Catapult. All rights reserved. Secondary (compiled from literature review, databases) Primary (preferred) (measured in the same assay) Assay conditions Assay conditions Compound Compound * DeepADMET – Data Structure
  25. 25. © 2019 Medicines Discovery Catapult. All rights reserved. The Future
  26. 26. © 2019 Medicines Discovery Catapult. All rights reserved. https://stevenmiller888.github.io/mind-how-to-build-a-neural-network/ Neural Networks
  27. 27. © 2019 Medicines Discovery Catapult. All rights reserved. Assays in Drug Discovery Biochemical assays Cell-based assays Functional assays In vivo assays Human studies Proteins Cell lines Tissues & organs Animal models Humans ancient “Human clinical trial” • Error prone, serendipitous discoveries • Traditional medicines: aspirin, quinine, …
  28. 28. © 2019 Medicines Discovery Catapult. All rights reserved. Assays in Drug Discovery Biochemical assays Cell-based assays Functional assays In vivo assays Human studies Proteins Cell lines Tissues & organs Animal models Humans 1910s ancient Animal in vivo assays • Faster, safer, cheaper • … but less predictive
  29. 29. © 2019 Medicines Discovery Catapult. All rights reserved. Assays in Drug Discovery Biochemical assays Cell-based assays Functional assays In vivo assays Human studies Proteins Cell lines Tissues & organs Animal models Humans 1920s 1910s ancient Ex vivo assays • Higher throughput, cheaper • Mechanistic insights • … but less predictive
  30. 30. © 2019 Medicines Discovery Catapult. All rights reserved. Assays in Drug Discovery Biochemical assays Cell-based assays Functional assays In vivo assays Human studies Proteins Cell lines Tissues & organs Animal models Humans 1950s 1920s 1910s ancient Cell-based assays • Higher throughput, cheaper • Mechanistic insights • … but less predictive
  31. 31. © 2019 Medicines Discovery Catapult. All rights reserved. Assays in Drug Discovery Biochemical assays Cell-based assays Functional assays In vivo assays Human studies Proteins Cell lines Tissues & organs Animal models Humans 1970s 1950s 1920s 1910s ancient Biochemical assays • Higher throughput • Mechanistic insights • Recombinant DNA technology • … but less predictive
  32. 32. © 2019 Medicines Discovery Catapult. All rights reserved. Example Assay Path: Anti-inflammatory Drugs Prostaglandin G/H synthase 2 LPS-stimulated THP-1 cells LPS-stimulated human whole blood carrageenan- injected rat acute gout patient
  33. 33. © 2019 Medicines Discovery Catapult. All rights reserved.
  34. 34. © 2019 Medicines Discovery Catapult. All rights reserved. • Finding Assays • Text-mining across papers, patents, vendor catalogues • Indexing of Assays • specialist dictionaries - techniques, equipment, genes, end-points, …. • Classification of assays • Efficacy/ADMET & biochemical, cell-based, organoid, tissue, …. • Similarity of Assays • how ‘similar’ are two assays? • Chaining of Assays • constructing the directed graph • Learning thresholds • Identification of ‘triggers’ from chained, directed assay pairs AssayNet – Building the Network
  35. 35. © 2019 Medicines Discovery Catapult. All rights reserved.
  36. 36. © 2019 Medicines Discovery Catapult. All rights reserved.
  37. 37. © 2019 Medicines Discovery Catapult. All rights reserved.
  38. 38. © 2019 Medicines Discovery Catapult. All rights reserved.
  39. 39. © 2019 Medicines Discovery Catapult. All rights reserved. Assay 1 Assay 2 • Decision Thresholds • What activity threshold in Assay 1 makes it worth measuring in Assay 2? • Learn from statistical distributions • Probably artefactually thresholded at integral pIC50 thresholds – e.g. 1mM (cf P-value distributions) Learning Decision Thresholds pIC50 pIC50 # # Compounds selected for screening in assay 2 Distribution of activity values of compounds in Assay 1 Sharp cutoff Sampled cutoff
  40. 40. © 2019 Medicines Discovery Catapult. All rights reserved. Bayesian Networks
  41. 41. © 2019 Medicines Discovery Catapult. All rights reserved. Bioassay data - ChEMBL Database IC50 4.5 nM >Thrombin MAHVRGLQLPGCLALAALCSLVHSQHVFLA PQQARSLLQRVRRANTFLEEVRKGNLEREC VEETCSYEEAFEALESSTATDVFWAKYTAC ETARTPRDKLAACLEGNCAEGLGTNYRGHV APTT 11 min Target Compoun d Bioassay data Compound Assay • Data manually extracted by a team of curators from published pharmacology and drug discovery literature (e.g. Journal of Medicinal Chemistry) • ChEMBL has transformed many aspects of cheminformatics research − Target prediction − Large-scale QSAR − Matched Molecular Pairs − … • ChEMBL is foundation data source of almost all published AI compound design research
  42. 42. © 2019 Medicines Discovery Catapult. All rights reserved. 1 a b d 2 3c 5e 4 g f h 6 ChEMBL as a Graph assay-assay network compound-compound network b f c h ge a d 1 a 1 a compound assay has activity in Zwierzyna & Overington (in preparation) 1 2 4 6 5 3
  43. 43. © 2019 Medicines Discovery Catapult. All rights reserved. Assay Network: Binding Assay Data (Subset) A subset of the assay network (~6,000 nodes) constructed using protein-binding assay data from ChEMBL Zwierzyna & Overington (in preparation)
  44. 44. © 2019 Medicines Discovery Catapult. All rights reserved. Assay Network: Preclinical Assay Data PPAR binding assay DPP-4 binding assay in vivo assay cell-based assay Zwierzyna & Overington (in preparation) • Fragment of the assay network with a subset of bioassays testing antidiabetic compounds • Assays involving closely related biological targets are clustered together, e.g. assays involving various peroxisome proliferator-activated receptors in the green cluster • Antidiabetic compounds with different mechanism of action (e.g. DPP-4 inhibitors and PPAR agonists) are often tested in the same animal model (such as Zucker diabetic rat) → in vivo assays link distinct clusters
  45. 45. © 2019 Medicines Discovery Catapult. All rights reserved. Animal Models: Assay Descriptions CHEMBL893931: “Inhibition of carrageenan-induced paw oedema in Sprague-Dawley rat at 5.16 mg/kg, sc after 3 hrs.”
  46. 46. © 2019 Medicines Discovery Catapult. All rights reserved. Animal Models: Assay Descriptions Induced Model Phenotype Genetic Strain Dosage Administratio n Route Timing CHEMBL893931: “Inhibition of carrageenan-induced paw oedema in Sprague-Dawley rat at 5.16 mg/kg, sc after 3 hrs.”
  47. 47. © 2019 Medicines Discovery Catapult. All rights reserved. Information Extraction From Assay Descriptions Antiallodynicactivity in Wistar albino rat chronicconstrictioninjury-induced neuropathic pain model assessed as attenuation of mechanicalallodynia JJ NN IN NNP NN NN JJ NN JJ JJ NN NN VBN IN NN IN JJ NN NP PP NP VP PP NP PP NP S CHEMBL1799193: Antiallodynicactivity in Wistar albino rat chronic constriction injury-induced neuropathic pain model assessed as attenuation of mechanical allodynia. Antiallodynicactivity Wistar albino rat chronicconstrictioninjury-induced neuropathic pain model assessed attenuation mechanicalallodynia Experiment Phenotype PhenotypeStrain Antiallodynicactivity Wistar albino rat chronicconstrictioninjury-induced neuropathic pain model assessed attenuation mechanicalallodynia A B C D Antiallodynicactivity in Wistar albino rat chronicconstriction injury-induced neuropathic pain model assessed as attenuation of mechanical allodynia JJ NN IN NNP NN NN JJ NN JJ JJ NN NN VBN IN NN IN JJ NN NP PP NP VP PP NP PP NP S CHEMBL1799193: Antiallodynicactivity in Wistar albino rat chronic constriction injury-induced neuropathic pain model assessed as attenuation of mechanical allodynia. Antiallodynicactivity Wistar albino rat chronicconstriction injury-induced neuropathic pain model assessed attenuation mechanical allodynia Experiment Phenotype PhenotypeStrain Antiallodynicactivity Wistar albino rat chronicconstrictioninjury-induced neuropathic pain model assessed attenuation mechanical allodynia A B C D Sentence Noun Phrase Verb Phrase AdjectiveNoun Verb Prepositional Phrase Antiallodynicactivity in Wistar albino rat chronicconstrictioninjury-induced neuropathic pain model assessed as attenuation of mechanical allodynia JJ NN IN NNP NN NN JJ NN JJ JJ NN NN VBN IN NN IN JJ NN NP PP NP VP PP NP PP NP S CHEMBL1799193: Antiallodynicactivity in Wistar albino rat chronic constriction injury-induced neuropathic pain model assessed as attenuation of mechanical allodynia. Antiallodynicactivity Wistar albino rat chronicconstrictioninjury-induced neuropathic pain model assessed attenuation mechanicalallodynia [9.11,8.73,9.19,...] [-0.17,-0.57,0.01,...] [8.95,3.39,-5.22,...] [9.08,8.02,8.09,...][9.11,8.73,9.19,...][9.56,9.14,2.10,...][9.10,8.72,9.18,...] Experiment Phenotype PhenotypeStrain Antiallodynicactivity Wistar albino rat chronicconstrictioninjury-induced neuropathic pain model assessed attenuation mechanical allodynia A B C D E Zwierzyna & Overington (in preparation)
  48. 48. © 2019 Medicines Discovery Catapult. All rights reserved. PCA of Word2Vec Assay Descriptions Each assay description: average over its word vectors. Data points projected from a 200-dimensional space to 2D using PCA Zwierzyna & Overington, unpublished
  49. 49. © 2019 Medicines Discovery Catapult. All rights reserved. Word2vec Embedding of Assays L01 (antineoplastic)M01 (anti-inflammatory) ChEMBL assays of known drugs annotated with different ATC codes (~15k of ~94k) N03 (antiepileptic) A10 (antidiabetic)C02 (antihypertensive) N02 (analgesic) Zwierzyna&Overington,unpublished
  50. 50. © 2019 Medicines Discovery Catapult. All rights reserved. Biochemical assay Cell-based screen Functional assay Animal disease model Human clinical trial Build assay networks from literature/patent co-occurrence Link to animal models and genetics Understand target engagement/ pharmacodynamics through development Directed graph of all assays from targets to clinical trials AssayNet – Translational Path From Lab To Clinic Compound
  51. 51. © 2019 Medicines Discovery Catapult. All rights reserved. Acknowledgements Bissan Al-Lazikani Aroon Hingorani, Juan Pablo-Casas Marc Marti-Renom Francesco Martinez Magda Zwierzyna Mark Davies Krister Wennerberg Mark Warren, Gemma Holliday, Andrew Pannifer Richard Seacome, James Welsh, Matthew Hodsgkiss Charles Bury, Kepa Brurusco-Goni, Daiel James, Adam Poulston, Matt Cockayne, Baydr Earls, Herve Barjat, Dave Allen, James Peach Nathan Dedman, George Papadatos, Grace Mugumbate, Anna Gaulton, Prudence Mutowo, Louisa Bellis, Anne Hersey, Jon Chambers, Michal Nowotka, Anneli Karlsson, Ines Smit, Francis Atkinson, Paula Magarinos, Felix Kruger, Rita Santos

×