VIP Call Girl Sector 88 Gurgaon Delhi Just Call Me 9899900591
Â
Machine Learning for Preclinical Research
1. MACHINE LEARNING FOR PRECLINICAL
RESEARCH
Paul Agapow <p.agapow@imperial.ac.uk>â¨
Data Science Institute, Imperial College London
Adv. Machine Learning & AI for Drug Discovery & Development (Berlin, June 2018)
2. BACKGROUND & DISCLOSURE
⤠Data Science Institute (Imperial
College London)
⤠Novel & advanced computation over
large rich biomedical datasets for
translational research & precision
medicine
⤠Patient subtype discovery &
mechanistic insight
⤠ScientiďŹc Advisor to PangaeaData.ai
3. ⤠Big Data is a problem
⤠Methodology is a problem
⤠Truth is a problem
⤠But maybe we can do something about it
5. BIOMEDICAL BIG DATA IS USUALLY NOT BIG (ENOUGH)
⤠Average trial size on
ClinicalTrials.gov < 100
⤠Average #samples per GEO
dataset < 100
⤠Average GWAS cohort size
~9000 (median ~2500)
⤠1,064 ICU admissions for ďŹu in
UK 2016/2017 season
⤠Curse of dimensionality
⤠Deep learning requires
âthousandsâ of samples for
training (at least p2?)
⤠GWAS needs 3K+ for large
eďŹects, 10K or more for small
eďŹects âŚ
⤠Sub-populations & rare diseases
will be smaller
VS
6. MAKE BIGGER DATASETS
⤠âAllowâ reuse & combining not âbuildâ
⤠FAIR
⤠Use standards like CDISC, HPO âŚ
⤠eTRIKS
⤠Europeâs largest public-private
initiative (pharma, academic, SME,
other)
⤠Data intensive translational research
⤠Data catalog of ~70 studies
⤠Sharing data (standards, starter kit)
7. WE NEED MORE ETL
⤠Too damn slow and expensive
⤠Tools are poor
⤠Humans are inconsistent
⤠Standards are complex
⤠Harmonisation by ML is the only
answer
⤠Learn from data examples
⤠Corrected by humans
⤠âDiscoverâ schema if need be
1
2
3
4
1
2
3
4
Text data
Tabular data
§ Frequent Pattern Mining-Growth Algorithms to
determine schema association rules
§ Word2Vec to condense information of text sequence and
context
§ Graph-Theoretical Algorithms to determine logical
sequences, followers, associations, matchings
§ Decision Trees, Neural Nets and Support Vector
Machines for training the model
§ Custom Algorithms to prepare data and check data quality
Pre-classified
data and master
data mappingsData
extractor
Data
extractor
From PangaeaData.AI
8. âOn Big Data, data collection biases are always
larger than statistical uncertainty
-Daniel Himmelstein
9. THE SIGNAL TO NOISE RATIO IS POOR
⤠Sampling bias
⤠P-hacking
⤠Garden of forking paths
⤠Reversion to mean
⤠Multiple hypothesis testing
⤠False discovery
⤠P-values
⤠Which method is best?
⤠Omnigenics (every gene eďŹects every
other gene)
10. EXAMPLE: U-BIOPRED
⤠Unbiased BIOmarkers in PREDiction of
respiratory disease outcomes
⤠900+ patients, 16 clinical centres +
other studies combined via standards
⤠Outputs:
⤠Analyses largely on small subsets
(~100)
⤠Subtyping of asthmatics
⤠40+ academic publications
11.
12. THE REALITY OF DEEP LEARNING
⤠Deep learning is still in progress
⤠Usually insuďŹcient (good labelled)
data
⤠Interpretability issues
⤠Legal & ethical issues, federated
analysis
⤠Tells you what youâve told it
⤠Bias towards images
⤠For now âŚ
13. DEEP LEARNING WITH LESS DATA
⤠Pre-training (data without labels)
⤠Initial training with mediocre data
⤠Adapt
⤠Transfer learning (labels / output changes)
⤠Domain adaptation (data / input changes)
⤠Data augmentation
⤠Interpretability coming slowly (LIME)
Dielman 2015
14. â80% of the time, you can get 80% of the way
with a simple decision tree.
- Doug Mcilwraith (paraphrased)
15. EXAMPLE: TEXT CLASSIFICATION FOR SYSTEMATIC REVIEWS
⤠Aim: ďŹnd similar or related
publications within corpus
⤠Actual aim: ďŹnd which
which method of text
classiďŹcation is
âbestâ (Validation)
⤠Data: 15 Drug Control
Reviews & Neuropathic
Pain dataset
⤠Classify with random forest,
naive bayes, SVM & CNNs
Conclusion
Dataset WSS Classifier Dataset WSS Classifier
ACE Inhibitors 0.26 SVM NSAIDS 0.14 SVM
ADHD 0.35 MNB Opioids 0.23 SVM
Antihistamines 0.19 MNB Oral
Hypoglycemics
0.21 SVM
Atypical
Antipsychotics
0.12 SVM PPI 0.17 SVM
Beta Blockers 0.13 SVM Skeletal Muscle
Relaxants
0.21 SVM
CCB 0.21 SVM Statins 0.19 SVM
Estrogen 0.25 SVM Triptans 0.22 SVM
Neuropathic Pain 0.61 CNN Urinary
Incontinence
0.25 SVM
16. EXAMPLE: ASTHMA ENDOTYPING
⤠Asthma is highly heterogenous
⤠Symptoms
⤠Response to interventions
⤠Multiple mechanisms
⤠3 or 4 or 7 clusters âŚ
⤠Carefully curated data from U-
BIOPRED (~100)
⤠Analyse âsmartâ: use appropriate
analysis
Wiki Commons
17. MULTI- OR INTEGRATED OMICS
⤠Why?
⤠One way to get more data
⤠Statistical power
⤠Multiple defects required to drive
endogenous disease
⤠Multiple âviewsâ on condition
⤠How?
⤠Cluster / network individual data
layers
⤠Fuse together for consensus
Nemutlu 2012
18. ASTHMA ENDOTYPES
⤠(Validate your methods)
⤠Use a variety of clustering approaches
over asthma cohort âomics data
(bayesian, spectral, iCluster)
⤠Use multi-omics approaches (SNF,
NNMF)
⤠Assess agreement / coherence
⤠Validate in pathways, in other cohorts
and in other data types
19. KNOWLEDGE GRAPHS
⤠Much eďŹort being spent in building
them but:
⤠What are they for?
⤠Facts arenât just facts
⤠âRelationshipsâ need to be rich but
loose
⤠Schema-less databases need schema
⤠Graphs may not be the right tool
Meng Wang, 2017
20. KNOWLEDGE GRAPHS NEED CONTEXT
⤠Aim: extract biological relationships from
publications to build asthma knowledge
base
⤠Domain expert time is prohibitive
⤠Use previous eďŹorts as training
⤠OpenBEL (biological expression
language)
⤠Wide range of relationships & entities
⤠Grakn
⤠Allows hyper-relationships &
inheritance
21. CONCLUSIONS
⤠Big biomedical data is often not big, but we can make it bigger
⤠But even big data is not without its problems
⤠Sometimes [Big | Deep | Advanced] approaches are useful, sometimes not: choose
wisely
⤠Trust but verify
22. âSuccess in the pre-clinical arena will come from
carefully curated data, melding together disparate
data sources & types, careful building of large
datasets through consortia & alliances followed
by appropriate use of machine learning and
validated at the bench or in the clinic.
24. MLMH2018 - KDD Workshop on Machine
Learning for Medicine and Healthcare
August 20, 2018, London, UK
Topics of interest:
â˘âŻ Data Standards for Translational
Medicine Informatics
â˘âŻ Analysis of large scale electronic
health records or patient-
generated health data records
â˘âŻ Visualisation of complex and
dynamic biomedical networks
â˘âŻ Disease Subtype Discovery for
Precision Medicine
â˘âŻ Interpretable Machine Learning for
biomedicine and healthcare
â˘âŻ Deep learning for biomedicine
Important Dates
â˘âŻ Submission deadline:
May 25, 2018
â˘âŻ NotiďŹcation accept:
June 8, 2018
â˘âŻ Workshop date:
August 8, 2018
Meet our Panel!
T. Roy (Ph.D), University of
Southampton, UK
A. Teredesai (PhD), University of
Washington, Tacoma
S. Wagers (MD), CEO/Founder
BioSci Consulting, Belgium
Join us during the KDD Health Day!
Win IBM $1,000 travel grant for best
selected student paper!
Follow us!
https://mlmhworkshop.github.io/mlmh-2018
Twitter:
Contact us:
mlmhworkshop@googlegroups.com
Organizers:
M. Saqi, Imperial College London, UK
P. Chakraborty, IBM Research, USA
I. Balaur, EISBM, Lyon, France
P. Agapow, Imperial College London, UK
S. Wagers, BioSci Consulting, Belgium
P.Y. S. Hsueh, IBM Research, USA
F. Rahmanian, Geneia, USA
M.A. Ahmad, Kensci Inc. and University of
Washington - Tacoma, USA