New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Data Science for (Health) Science:tales from a challenging front line, and how to cross a few T's
1. Data Science for (Health) Science:
tales from a challenging front line, and how to cross a few T's
Paolo Missier
School of Computing
Newcastle University, UK
March 2021
A talk given to
The School of Information Sciences
Center for Informatics Research in Science and Scholarship
University of Illinois Urbana-Champaign
paolo.missier@ncl.ac.uk
LinkedIn: paolomissier
Twitter: @PMissier
2. 2
The message:
1. “Data Science” for Health is hard. The hard part is the data
2. “AI for Health” is (Deep) Machine Learning
3. Ethics. Fairness. Trust. Acceptance.
4. Data Provenance for Data Science: Solution or distraction?
• Transparency
• Trustworthiness
• Traceability
4. 4
AI for healthcare – the UK landscape
https://www.turing.ac.uk/research/research-programmes/health-and-medical-sciences
AI and data science will improve the detection, diagnosis, and treatment of
illness. They will optimise the provision of services, and support health service
providers to anticipate demand and deliver improved patient care.
• Explainability / Interpretability
• Exploiting EHR (Electronic Health Records)
• Image interpretation
• Fairness, Bias
• Ethical issues in …
• Predicting <disease / critical event> …
5. 5
Personalised, Predictive, Preventive, Participatory Medicine (P4)
Price ND, Magis AT, Earls JC, et al. A wellness study of 108 individuals using personal, dense, dynamic data clouds.
Nat Biotechnol. 2017;35:747.
6. 6
(*) Data-Driven, Personalised, Predictive, Preventive, Participatory
D2P4 (*)
Healthcare
research
• Cleaning
• Integration
• Alignment
• Imputation
• NLP
• …
Physical Activity monitoring
(wearables)
In-patient hospital records
Primary care health records
+ prescriptions
Clinical protocols
Multi-omics
(genomics, transcriptomics,
proteomics, metabolomics…)
Images -- Histology, X-ray, …
Early detection of Type 2 Diabetes /
Metabolic / age-related diseases
Early detection of Parkinson’s
Frailty / intrinsic capacity assessment
Multi Morbidity Long Term Conditions (MLTC)
Covid risk / Post-Acute Covid Syndrome (PACS)
Liver disease progression: NAFLD / NASH
Liquid biopsy
Programming:
Scripting: python, R, …
Workflows: Knime, RapidMiner..
Methods
Clustering (ML)
Predictive modelling (ML)
Image interpretation / Deep Learning
… “AI”…
(plus traditional statistics!)
7. 7
Big Data for Health Care
Genomics for
personalized medicine
personal monitors /
wearables
Medical Records
Article Source: Big Data: Astronomical or Genomical?
Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, et al. (2015) Big Data: Astronomical or Genomical?. PLOS Biology 13(7):
e1002195. https://doi.org/10.1371/journal.pbio.1002195
8. 9
D2P4 Accelerometry
Physical Activity monitoring
(wearables)
In-patient hospital records
Primary care health records
+ prescriptions
Clinical protocols
Multi-omics
(genomics, transcriptomics,
proteomics, metabolomics…)
Images -- Histology, X-ray, …
Early detection of Type 2 Diabetes /
Metabolic / age-related diseases
Early detection of Parkinson’s
Frailty / intrinsic capacity assessment
Multi Morbidity Long Term Conditions (MLTC)
Covid risk / Post-Acute Covid Syndrome (PACS)
Liver disease progression: NAFLD / NASH
Liquid biopsy
• Cleaning
• Integration
• Alignment
• Imputation
• NLP
• …
Programming:
Scripting: python, R, …
Workflows: Knime, RapidMiner..
Methods
Clustering (ML)
Predictive modelling (ML)
9. 10
Digital biomarkers
Digital biomarkers come from "novel sensing systems capable of continuously tracking
behavioral signals […] capture people's everyday routines, actions, and physiological
changes that can explain outcomes related to health, cognitive abilities, and more”
(Choudhury 2018).
Choudhury, Tanzeem. 2018. “Making Sleep Tracking More User Friendly.” Communications of the ACM 61 (11): 156–156.
https://doi.org/10.1145/3266285
- physical activity
- glucose levels
- blood oxygen
levels
- …
Inexpensive scalable personalised self-monitoring
10. 11
A first project: markers from accelerometers?
Initial study Digital biomarkers + UK Biobank Dataset + Type 2 Diabetes outcome
- physical activity
- glucose levels
- blood oxygen levels
- …
Aligned with the P4 agenda
Readily available dataset
(+) 3,500+ features
(+) multi-omics coverage
(+) genomics
(+) links to EHR
(+) Activity monitors made in Newcastle!
(-) Limited follow-ups – little longitude
(-) Population not random
(-) Activity data / person very limited
100K
Activity traces
11. 13
Using wearable activity trackers to predict Type-2 Diabetes
Objective: To determine the extent to which accelerometer traces can be used to distinguish individuals with
Type-2 Diabetes (T2D) from normoglycaemic controls, and to quantify their limitations.
Lam B, Catt M, Cassidy S, Bacardit J, Darke P, Butterfield S, Alshabrawy O, Trenell M, Missier P
Using Wearable Activity Trackers to Predict Type 2 Diabetes: Machine Learning–Based Cross-sectional Study of the UK Biobank Accelerometer
Cohort -- JMIR Diabetes. 20/01/2021:23364 (forthcoming/in press)
Feature
extraction
Clustering
Classification
??
13. 15
Filter:
Accelerometry study?
103,712
Split criteria:
Type 2 Diabetes?
At baseline: 2,755
Through EHR analysis: 1,321
Total: 4,076
Non-Diabetes
99,636
Filter:
EHR data available?
19,852
502, 664
All UK Biobank participants:
Filter:
QC on activity traces
3,103
Positives:
T2D vs Norm-0
Physical Impairment analysis
Severe impairment
1,666
No impairment
8,463
A great UG project!
your (biomedical) dataset may not be as big as it looks
T2D vs Norm-1
15. 17
Ongoing work
Are there better embedded representations for acceleremetry data?
Can they be used as predictors for other outcomes?
Representation learning
Embedded
feature space
LSTM Autoencoder
Outcome:
Insulin sensitivity
DIRECT
DB
Standard classification
16. 19
D2P4 COVID
Physical Activity monitoring
(wearables)
In-patient hospital records
Primary care health records
+ prescriptions
Clinical protocols
Multi-omics
(genomics, transcriptomics,
proteomics, metabolomics…)
Images -- Histology, X-ray, …
Early detection of Type 2 Diabetes /
Metabolic / age-related diseases
Early detection of Parkinson’s
Frailty / intrinsic capacity assessment
Multi Morbidity Long Term Conditions (MLTC)
Covid risk / Post-Acute Covid Syndrome (PACS)
Liver disease progression: NAFLD / NASH
Liquid biopsy
Programming:
Scripting: python, R, …
Workflows: Knime, RapidMiner..
Methods
Clustering (ML)
Predictive modelling (ML)
17. D. Ferrari1, Prof. F. Mandreoli1, Prof. G. Guaraldi2
Prof. P. Missier
Predicting respiratory failure in patients with COVID-19
pneumonia: a case study from Northern Italy
Peak of Italian Covid crisis (March 2020 onwards)
Issue: ICU Capacity
Question: will my next patient require ICU resources? How soon?
(1)
(2)
Machine learning in predicting respiratory failure in patients with COVID-19 pneumonia—Challenges, strengths, and opportunities in a global health
emergency
Ferrari D, Milic J, Tonelli R, Ghinelli F, Meschiari M, et al. (2020) Machine learning in predicting respiratory failure in patients with COVID-19 pneumonia—Challenges,
strengths, and opportunities in a global health emergency. PLOS ONE 15(11): e0239172. https://doi.org/10.1371/journal.pone.0239172
18. 21
Study structure
Applied Machine Learning driven by a clinical question
An example of typical data science pattern:
• Data selection inclusion, exclusion criteria
• Data preparation / cleaning
• Variable selection
• Model learning multiple models
• Model evaluation
With additional challenges:
“Live” evolving dataset with multiple versions of a patients database
• changes in recording practices
• Inconsistencies
• Lots of missing data
Small data: 198 patients 1068 observations 31-90 variables (symptoms, lab biomarkers)
In the data collection period, the dataset
was growing daily with the average of 84
new records per day, with a mean of 10 new
data points/patient.
out of the initial sample of 295 patients
and 2,889 data points available, 198
patients contributed to generate 1068
valuable observations. In detail, 603
observations contributed to the
definition of respiratory failure (PaO2/
FiO2 < 150 mmHg) and 465 did not
meet this definition.
Each data point included a complex record of observations
from multiple categories: (1) signs and symptoms, (2) blood
biomark- ers, (3) respiratory assessment with PaO2/FiO2, (4)
history of comorbidities (available in a sub- set of 119
patients). Some variables were collected daily, and others
were recorded upon clinical indications.
20. 24
Modelling Requirements
• Parsimonious few variables
• Robust to missing data imputation not an option
• Explainable Trust
• model reveals the relative importance of each variable for each prediction it
makes
• Minimize the number of false negatives
• risk of under-estimating the severity of a patient’s condition
21. 26
Approach
• Parsimonious feature ranking and selection
• Robust to missing data
• Explainable Shapley values
• Minimize FN bespoke loss function
Ensemble of Decision trees
22. 27
Testing multiple models - Results
Parsimony:
Model 1 - suboptimal prediction accuracy
Model 2:
Adding biomarkers including respiratory variables increased performance
Model 3:
boosted mixed model - still requires about 20 variables
From a physician’s perspective, a cluster of 20 variables may be difficult to manage in routine clinical practice.
What our approach offers in support to the decision-making process is a simple interpretation of the predictions.
24. 29
Summary
Good results on “live” data, predicting a useful outcome for the purpose of ICU management
Major selling points:
• Variables (relatively) easy to collect in routine visits and in-hospital
• Models are explainable, medics can reality-check against their own understanding
… Opened the door to further collaborations:
New project on PACS: Post-Acute Covid Syndrome:
Following up recovery paths for 300 patients across 5 hospitals
25. 30
D2P4 EHR analysis for dynamic risk prediction
D2P4 (*)
Healthcare
research
Physical Activity monitoring
(wearables)
In-patient hospital records
Primary care health records
+ prescriptions
Clinical protocols
Multi-omics
(genomics, transcriptomics,
proteomics, metabolomics…)
Images -- Histology, X-ray, …
Early detection of Type 2 Diabetes /
Metabolic / age-related diseases
Early detection of Parkinson’s
Frailty / intrinsic capacity assessment
Multi Morbidity Long Term Conditions (MLTC)
Covid risk / Post-Acute Covid Syndrome (PACS)
Liver disease progression: NAFLD / NASH
Liquid biopsy
Programming:
Scripting: python, R, …
Workflows: Knime, RapidMiner..
Methods
Clustering (ML)
Predictive modelling (ML)
Survival analysis
Longitudinal prediction models
27. 32
Clinical Risk Prediction Models
Healthy participant or
missing data/under-
reported conditions?
Number/pattern of
records is a proxy
for health?
Informed presence bias
Individuals in EHR data are systematically different to those who are not (Goldstein et al, 2016)
28. 36
Case study: Type 2 Diabetes
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
30
40
50
60
70
80
Pre⌧diabetes
D
iabetes
R
em
ission
G
lycated
hem
oglobin
HbA
1c
(m
m
ol/m
ol)
Participant:
R
ED
A
C
T
ED
●
●
●
●
4
6
8
10
12
●
Prim
ary
care
records
U
K
B
B
visit
●
●
●
N
orm
oglycaem
ic
Pre⌧diabetic
D
iabetic
Fasting
plasm
a
glucose
(m
m
ol/l)
P r im a r y ca r e
Se
con d a r y ca r e
E v e
n t
O b s
D r u g
D ia g
O p
1987
(age
X
)
1991
(age
X
)
1995
(age
X
)
1999
(age
X
)
2003
(age
X
)
2007
(age
X
)
2011
(age
X
)
2015
(age
X
)
Estim
ated
observation
period
R
ecord
D
iabetes
record
Electronic
health
records
Figure
17:
Example
output
of
the
phenotyping
tool.
39
29. 37
Case study: Type 2 Diabetes – remission study
Type 2 diabetes remission
Longitudinal phenotyping with large–scale observational data
Philip Darke
EPSRC Centre for Doctoral
Training in Cloud Computing for
Big Data Newcastle University
UK Biobank is a UK–based prospective study into illness in mid- ukbiobank.ac.uk
dle and old age with over 500,000 participants. Diabetes is one of
the most prevalent conditions in the cohort with nearly 70,000 diag-
noses2 expected by 2027. Study data is collected at participant visits 2
Naomi Allen, et al. UK Biobank:
Current status and what it means
for epidemiology. Health Policy and
Technology, 1(3):123–126, September
2012. doi : 10.1016/ j.hlpt.2012.07.003
and via linkage to national datasets including EHR data. These data
have been used to longitudinally phenotype over 200,000 partici-
pants for diabetes as illustrated in figure 1. The approach will be
expanded to all participants when further data is released.
●
●
● ●
● ●
● ● ●
● ●
30
40
50
60
HbA1c
(mmol/mol)
Pre−diabetes Type 2 diabetes Remission
●
● ●
● ●
●
●
● ● ● ●
● ● ●
●
70
80
90
100
Weight
(kg)
Biguanides
12.5
15.0
17.5
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
Figure 1: Model output showing
HbA1c, weight, periods of medication
and inferred diabetic status for an
example participant. Long–term
remission was achieved by sustained
weight loss post diagnosis.
Many of those diagnosed with type 2 diabetes experience a sub-
sequent period of remission. Some relapse whilst others achieve
long–term remission and cease anti–diabetes medication. This
project will examine the pathways to remission at scale using ob-
30. 38
D2P4 MLTC-M
Physical Activity monitoring
(wearables)
In-patient hospital records
Primary care health records
+ prescriptions
Clinical protocols
Multi-omics
(genomics, transcriptomics,
proteomics, metabolomics…)
Images -- Histology, X-ray, …
Early detection of Type 2 Diabetes /
Metabolic / age-related diseases
Early detection of Parkinson’s
Frailty / intrinsic capacity assessment
Multiple Long Term Conditions (MLTC)
Covid risk / Post-Acute Covid Syndrome (PACS)
Liver disease progression: NAFLD / NASH
Liquid biopsy
Programming:
Scripting: python, R, …
Workflows: Knime, RapidMiner..
Methods
Clustering (ML)
Predictive modelling (ML)
NLP
31. 39
<event
name>
Multimorbidity and Long-Term Conditions
Patients with multimorbidities have the greatest healthcare needs and generate the
highest expenditure in the health system.
There is an increasing focus on identifying specific disease combinations for
addressing poor outcomes.
Matrix factorization / factor analysis
Clustering
Multiple correspondence analysis
Network analysis
…
Which data?
Fragmented / disconnected data sources
Data access
Data governance
32. 40
D2P4 NAFLD / non-alcohol fatty liver disease
Physical Activity monitoring
(wearables)
In-patient hospital records
Primary care health records
+ prescriptions
Clinical protocols
Multi-omics
(genomics, transcriptomics,
proteomics, metabolomics…)
Images -- Histology
Early detection of Type 2 Diabetes /
Metabolic / age-related diseases
Early detection of Parkinson’s
Frailty / intrinsic capacity assessment
Multi Morbidity Long Term Conditions (MLTC)
Covid risk / Post-Acute Covid Syndrome (PACS)
Liver disease progression: NAFLD / NASH
Liquid biopsy
• Cleaning
• Integration
• Alignment
• Imputation
• NLP
• …
Programming:
Scripting: python, R, …
Workflows: Knime, RapidMiner..
Methods
Clustering (ML)
Predictive modelling (ML)
Image interpretation / Deep Learning
… “AI”…
33. 41
D2P4 NAFLD / NASH
NASH = non-alcoholic steatohepatitis
Aims:
- integrate cross-sectional and longitudinal outcomes clinical data with
a multi-dimensional ‘omics’ record
- Hypothesis: a precision medicine approach leads to better
understanding of individuals’ trajectories
- Personalised biomarkers liquid biopsy
Dataset: European NAFLD Registry
7,750 patients with histologically proven NAFLD/NASH
- Omics (cross-sectional)
- Longitudinal follow ups
Methods:
- Precision: clustering
- Anticipating progression: Learn cluster-specific longitudinal models
34. 42
DP4DS: Data Provenance for Data Science
D2P4
+
DP4DS(*)
Physical Activity monitoring
(wearables)
In-patient hospital records
Primary care health records
+ prescriptions
Clinical protocols
Multi-omics
(genomics, transcriptomics,
proteomics, metabolomics…)
Images -- Histology, X-ray, …
Early detection of Type 2 Diabetes /
Metabolic / age-related diseases
Early detection of Parkinson’s
Frailty / intrinsic capacity assessment
Multi Morbidity Long Term Conditions (MLTC)
Covid risk / Post-Acute Covid Syndrome (PACS)
Liver disease progression: NAFLD / NASH
Liquid biopsy
Programming:
Scripting: python, R, …
Workflows: Knime, RapidMiner..
Methods
Clustering (ML)
Predictive modelling (ML)
Image interpretation / Deep Learning
… “AI”…
(plus traditional statistics!)
35. 43
Data Model Predictions
Model
pre-processing
Raw
datasets
features
Predicted you:
- Ranking
- Score
- Class
Data
collection
Instances
Key decisions are made during data selection and
processing:
- Where does the data come from?
- What’s in the dataset?
- What transformations were applied?
Complementing current ML approaches to model interpretability
1. Can we explain these decisions?
2. Are these explanations useful?
36. 44
Explaining data preparation
Data
collection
Model
Population data pre-processing
Raw
datasets
features
Predicted you:
- Ranking
- Score
- Class
- Integration
- Cleaning
- Outlier removal
- Normalisation
- Feature selection
- Class rebalancing
- Sampling
- Stratification
- …
Data acquisition and wrangling:
- How were datasets acquired?
- How recently?
- For what purpose?
- Are they being reused /
repurposed?
- What is their quality?
Instances
- Scripts Python / TensorFlow, Pandas, Spark
- Workflows Knime, …
Provenance Transparency
37. 46
Recent early results
A small grassroots project… [1]
- Formalisation of provenance patterns for pipeline operators
- Systematic collection of fine-grained provenance from (nearly) arbitrary pipelines
- Reality check:
- How much does it cost? provenance volume
- Does it help? queries against the provenance database
[1]. Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science. Chapman, A., Missier,
P., Simonelli, G., & Torlone, R. PVLDB, 14(4):507-520, January, 2021.
38. 47
Operators
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
Data reduction
- Feature selection
- Instance selection
Data augmentation
- Space transformation
- Instance generation
- Encoding (eg one-hot…)
Data transformation
- Data repair
- Binarisation
- Normalisation
- Discretisation
- Imputation
Ex.: vertical augmentation adding columns
39. 48
Code instrumentation
Create a provlet for
a specific
transformation
Initialize provenance
capture
…code injection is now being automated!
47. 56
Summary
Multiple hypotheses regarding Data Provenance for Data Science:
1. Is it practical to collect fine-grained provenance?
1. To what extent can it be done automatically?
2. How much does it cost?
2. Is it also useful? what is the benefit to data analysts?
Work in progress! Interest? Ideas?
48. 57
Acknowledgments
Prof. Mike Catt
PhD Students: Ben Lam, Philip Darke
MSc student: Sam Butterfield
Prof. Guaraldi
Prof. Mandreoli
MSc student: Davide Ferrari
Prof. Torlone
MSc student: Giulia Simonelli
Prof. Chapman
Hinweis der Redaktion
CVD leading cause of death for males (15.5%) and seconds for females (8.8%) in 2015 (*)
How about the data used to train / build the model?
Relatively easy to keep track of data pre-processing provenance
\newcommand{\f}{\textbf{a}}
\text{features}~ X=[\f_1 \ldots \f_k]
\text{new features}~ Y=[\f'_1 \ldots \f'_l]
\noindent new values for each row are obtained by applying $f$\\ to values in the $X$ features