SlideShare ist ein Scribd-Unternehmen logo
1 von 48
Data Science for (Health) Science:
tales from a challenging front line, and how to cross a few T's
Paolo Missier
School of Computing
Newcastle University, UK
March 2021
A talk given to
The School of Information Sciences
Center for Informatics Research in Science and Scholarship
University of Illinois Urbana-Champaign
paolo.missier@ncl.ac.uk
LinkedIn: paolomissier
Twitter: @PMissier
2
The message:
1. “Data Science” for Health is hard. The hard part is the data
2. “AI for Health” is (Deep) Machine Learning
3. Ethics. Fairness. Trust. Acceptance.
4. Data Provenance for Data Science: Solution or distraction?
• Transparency
• Trustworthiness
• Traceability
3
A Grand Challenge
https://epsrc.ukri.org/research/ourportfolio/themes/healthcaretechnologies/strategy/grandchallenges/
4
AI for healthcare – the UK landscape
https://www.turing.ac.uk/research/research-programmes/health-and-medical-sciences
AI and data science will improve the detection, diagnosis, and treatment of
illness. They will optimise the provision of services, and support health service
providers to anticipate demand and deliver improved patient care.
• Explainability / Interpretability
• Exploiting EHR (Electronic Health Records)
• Image interpretation
• Fairness, Bias
• Ethical issues in …
• Predicting <disease / critical event> …
5
Personalised, Predictive, Preventive, Participatory Medicine (P4)
Price ND, Magis AT, Earls JC, et al. A wellness study of 108 individuals using personal, dense, dynamic data clouds.
Nat Biotechnol. 2017;35:747.
6
(*) Data-Driven, Personalised, Predictive, Preventive, Participatory
D2P4 (*)
Healthcare
research
• Cleaning
• Integration
• Alignment
• Imputation
• NLP
• …
Physical Activity monitoring
(wearables)
In-patient hospital records
Primary care health records
+ prescriptions
Clinical protocols
Multi-omics
(genomics, transcriptomics,
proteomics, metabolomics…)
Images -- Histology, X-ray, …
Early detection of Type 2 Diabetes /
Metabolic / age-related diseases
Early detection of Parkinson’s
Frailty / intrinsic capacity assessment
Multi Morbidity Long Term Conditions (MLTC)
Covid risk / Post-Acute Covid Syndrome (PACS)
Liver disease progression: NAFLD / NASH
Liquid biopsy
Programming:
Scripting: python, R, …
Workflows: Knime, RapidMiner..
Methods
Clustering (ML)
Predictive modelling (ML)
Image interpretation / Deep Learning
… “AI”…
(plus traditional statistics!)
7
Big Data for Health Care
Genomics for
personalized medicine
personal monitors /
wearables
Medical Records
Article Source: Big Data: Astronomical or Genomical?
Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, et al. (2015) Big Data: Astronomical or Genomical?. PLOS Biology 13(7):
e1002195. https://doi.org/10.1371/journal.pbio.1002195
9
D2P4  Accelerometry
Physical Activity monitoring
(wearables)
In-patient hospital records
Primary care health records
+ prescriptions
Clinical protocols
Multi-omics
(genomics, transcriptomics,
proteomics, metabolomics…)
Images -- Histology, X-ray, …
Early detection of Type 2 Diabetes /
Metabolic / age-related diseases
Early detection of Parkinson’s
Frailty / intrinsic capacity assessment
Multi Morbidity Long Term Conditions (MLTC)
Covid risk / Post-Acute Covid Syndrome (PACS)
Liver disease progression: NAFLD / NASH
Liquid biopsy
• Cleaning
• Integration
• Alignment
• Imputation
• NLP
• …
Programming:
Scripting: python, R, …
Workflows: Knime, RapidMiner..
Methods
Clustering (ML)
Predictive modelling (ML)
10
Digital biomarkers
Digital biomarkers come from "novel sensing systems capable of continuously tracking
behavioral signals […] capture people's everyday routines, actions, and physiological
changes that can explain outcomes related to health, cognitive abilities, and more”
(Choudhury 2018).
Choudhury, Tanzeem. 2018. “Making Sleep Tracking More User Friendly.” Communications of the ACM 61 (11): 156–156.
https://doi.org/10.1145/3266285
- physical activity
- glucose levels
- blood oxygen
levels
- …
Inexpensive  scalable personalised self-monitoring
11
A first project: markers from accelerometers?
Initial study Digital biomarkers + UK Biobank Dataset + Type 2 Diabetes outcome
- physical activity
- glucose levels
- blood oxygen levels
- …
Aligned with the P4 agenda
Readily available dataset
(+) 3,500+ features
(+) multi-omics coverage
(+) genomics
(+) links to EHR
(+) Activity monitors made in Newcastle!
(-) Limited follow-ups – little longitude
(-) Population not random
(-) Activity data / person very limited
100K
Activity traces
13
Using wearable activity trackers to predict Type-2 Diabetes
Objective: To determine the extent to which accelerometer traces can be used to distinguish individuals with
Type-2 Diabetes (T2D) from normoglycaemic controls, and to quantify their limitations.
Lam B, Catt M, Cassidy S, Bacardit J, Darke P, Butterfield S, Alshabrawy O, Trenell M, Missier P
Using Wearable Activity Trackers to Predict Type 2 Diabetes: Machine Learning–Based Cross-sectional Study of the UK Biobank Accelerometer
Cohort -- JMIR Diabetes. 20/01/2021:23364 (forthcoming/in press)
Feature
extraction
Clustering
Classification
??
14
Granular activity representation
feature extraction 60 features / day
15
Filter:
Accelerometry study?
103,712
Split criteria:
Type 2 Diabetes?
At baseline: 2,755
Through EHR analysis: 1,321
Total: 4,076
Non-Diabetes
99,636
Filter:
EHR data available?
19,852
502, 664
All UK Biobank participants:
Filter:
QC on activity traces
3,103
Positives:
T2D vs Norm-0
Physical Impairment analysis
Severe impairment
1,666
No impairment
8,463
A great UG project!
your (biomedical) dataset may not be as big as it looks
T2D vs Norm-1
16
(some) results
Negatives: HLAF SDL HLAF+SDL
Norm-0 Norm-2 Norm-0 Norm-2 Norm-0 Norm-2
RF .80 .68 .83 .78 .86 .77
LR .79 .70 .83 .78 .86 .78
XGB .78 .66 .80 .74 .85 .75
17
Ongoing work
Are there better embedded representations for acceleremetry data?
Can they be used as predictors for other outcomes?
Representation learning
Embedded
feature space
LSTM Autoencoder
Outcome:
Insulin sensitivity
DIRECT
DB
Standard classification
19
D2P4  COVID
Physical Activity monitoring
(wearables)
In-patient hospital records
Primary care health records
+ prescriptions
Clinical protocols
Multi-omics
(genomics, transcriptomics,
proteomics, metabolomics…)
Images -- Histology, X-ray, …
Early detection of Type 2 Diabetes /
Metabolic / age-related diseases
Early detection of Parkinson’s
Frailty / intrinsic capacity assessment
Multi Morbidity Long Term Conditions (MLTC)
Covid risk / Post-Acute Covid Syndrome (PACS)
Liver disease progression: NAFLD / NASH
Liquid biopsy
Programming:
Scripting: python, R, …
Workflows: Knime, RapidMiner..
Methods
Clustering (ML)
Predictive modelling (ML)
D. Ferrari1, Prof. F. Mandreoli1, Prof. G. Guaraldi2
Prof. P. Missier
Predicting respiratory failure in patients with COVID-19
pneumonia: a case study from Northern Italy
Peak of Italian Covid crisis (March 2020 onwards)
Issue: ICU Capacity
Question: will my next patient require ICU resources? How soon?
(1)
(2)
Machine learning in predicting respiratory failure in patients with COVID-19 pneumonia—Challenges, strengths, and opportunities in a global health
emergency
Ferrari D, Milic J, Tonelli R, Ghinelli F, Meschiari M, et al. (2020) Machine learning in predicting respiratory failure in patients with COVID-19 pneumonia—Challenges,
strengths, and opportunities in a global health emergency. PLOS ONE 15(11): e0239172. https://doi.org/10.1371/journal.pone.0239172
21
Study structure
Applied Machine Learning driven by a clinical question
An example of typical data science pattern:
• Data selection  inclusion, exclusion criteria
• Data preparation / cleaning
• Variable selection
• Model learning  multiple models
• Model evaluation
With additional challenges:
“Live” evolving dataset with multiple versions of a patients database
• changes in recording practices
• Inconsistencies
• Lots of missing data
Small data: 198 patients  1068 observations  31-90 variables (symptoms, lab biomarkers)
In the data collection period, the dataset
was growing daily with the average of 84
new records per day, with a mean of 10 new
data points/patient.
out of the initial sample of 295 patients
and 2,889 data points available, 198
patients contributed to generate 1068
valuable observations. In detail, 603
observations contributed to the
definition of respiratory failure (PaO2/
FiO2 < 150 mmHg) and 465 did not
meet this definition.
Each data point included a complex record of observations
from multiple categories: (1) signs and symptoms, (2) blood
biomark- ers, (3) respiratory assessment with PaO2/FiO2, (4)
history of comorbidities (available in a sub- set of 119
patients). Some variables were collected daily, and others
were recorded upon clinical indications.
22
A case study to illustrate the problem
24
Modelling Requirements
• Parsimonious  few variables
• Robust to missing data  imputation not an option
• Explainable  Trust
• model reveals the relative importance of each variable for each prediction it
makes
• Minimize the number of false negatives
• risk of under-estimating the severity of a patient’s condition
26
Approach
• Parsimonious  feature ranking and selection
• Robust to missing data
• Explainable  Shapley values
• Minimize FN  bespoke loss function
Ensemble of Decision trees
27
Testing multiple models - Results
Parsimony:
Model 1 - suboptimal prediction accuracy
Model 2:
Adding biomarkers including respiratory variables increased performance
Model 3:
boosted mixed model - still requires about 20 variables
From a physician’s perspective, a cluster of 20 variables may be difficult to manage in routine clinical practice.
What our approach offers in support to the decision-making process is a simple interpretation of the predictions.
28
Which are the most important predictors?
Shap values
29
Summary
Good results on “live” data, predicting a useful outcome for the purpose of ICU management
Major selling points:
• Variables (relatively) easy to collect in routine visits and in-hospital
• Models are explainable, medics can reality-check against their own understanding
… Opened the door to further collaborations:
New project on PACS: Post-Acute Covid Syndrome:
Following up recovery paths for 300 patients across 5 hospitals
30
D2P4  EHR analysis for dynamic risk prediction
D2P4 (*)
Healthcare
research
Physical Activity monitoring
(wearables)
In-patient hospital records
Primary care health records
+ prescriptions
Clinical protocols
Multi-omics
(genomics, transcriptomics,
proteomics, metabolomics…)
Images -- Histology, X-ray, …
Early detection of Type 2 Diabetes /
Metabolic / age-related diseases
Early detection of Parkinson’s
Frailty / intrinsic capacity assessment
Multi Morbidity Long Term Conditions (MLTC)
Covid risk / Post-Acute Covid Syndrome (PACS)
Liver disease progression: NAFLD / NASH
Liquid biopsy
Programming:
Scripting: python, R, …
Workflows: Knime, RapidMiner..
Methods
Clustering (ML)
Predictive modelling (ML)
Survival analysis
Longitudinal prediction models
31
Longitudinal data: Health-related events
https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41270
UK Biobank - Primary Care Linked Data
32
Clinical Risk Prediction Models
Healthy participant or
missing data/under-
reported conditions?
Number/pattern of
records is a proxy
for health?
Informed presence bias
Individuals in EHR data are systematically different to those who are not (Goldstein et al, 2016)
36
Case study: Type 2 Diabetes
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
30
40
50
60
70
80
Pre⌧diabetes
D
iabetes
R
em
ission
G
lycated
hem
oglobin
HbA
1c
(m
m
ol/m
ol)
Participant:
R
ED
A
C
T
ED
●
●
●
●
4
6
8
10
12
●
Prim
ary
care
records
U
K
B
B
visit
●
●
●
N
orm
oglycaem
ic
Pre⌧diabetic
D
iabetic
Fasting
plasm
a
glucose
(m
m
ol/l)
P r im a r y ca r e
Se
con d a r y ca r e
E v e
n t
O b s
D r u g
D ia g
O p
1987
(age
X
)
1991
(age
X
)
1995
(age
X
)
1999
(age
X
)
2003
(age
X
)
2007
(age
X
)
2011
(age
X
)
2015
(age
X
)
Estim
ated
observation
period
R
ecord
D
iabetes
record
Electronic
health
records
Figure
17:
Example
output
of
the
phenotyping
tool.
39
37
Case study: Type 2 Diabetes – remission study
Type 2 diabetes remission
Longitudinal phenotyping with large–scale observational data
Philip Darke
EPSRC Centre for Doctoral
Training in Cloud Computing for
Big Data Newcastle University
UK Biobank is a UK–based prospective study into illness in mid- ukbiobank.ac.uk
dle and old age with over 500,000 participants. Diabetes is one of
the most prevalent conditions in the cohort with nearly 70,000 diag-
noses2 expected by 2027. Study data is collected at participant visits 2
Naomi Allen, et al. UK Biobank:
Current status and what it means
for epidemiology. Health Policy and
Technology, 1(3):123–126, September
2012. doi : 10.1016/ j.hlpt.2012.07.003
and via linkage to national datasets including EHR data. These data
have been used to longitudinally phenotype over 200,000 partici-
pants for diabetes as illustrated in figure 1. The approach will be
expanded to all participants when further data is released.
●
●
● ●
● ●
● ● ●
● ●
30
40
50
60
HbA1c
(mmol/mol)
Pre−diabetes Type 2 diabetes Remission
●
● ●
● ●
●
●
● ● ● ●
● ● ●
●
70
80
90
100
Weight
(kg)
Biguanides
12.5
15.0
17.5
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
Figure 1: Model output showing
HbA1c, weight, periods of medication
and inferred diabetic status for an
example participant. Long–term
remission was achieved by sustained
weight loss post diagnosis.
Many of those diagnosed with type 2 diabetes experience a sub-
sequent period of remission. Some relapse whilst others achieve
long–term remission and cease anti–diabetes medication. This
project will examine the pathways to remission at scale using ob-
38
D2P4  MLTC-M
Physical Activity monitoring
(wearables)
In-patient hospital records
Primary care health records
+ prescriptions
Clinical protocols
Multi-omics
(genomics, transcriptomics,
proteomics, metabolomics…)
Images -- Histology, X-ray, …
Early detection of Type 2 Diabetes /
Metabolic / age-related diseases
Early detection of Parkinson’s
Frailty / intrinsic capacity assessment
Multiple Long Term Conditions (MLTC)
Covid risk / Post-Acute Covid Syndrome (PACS)
Liver disease progression: NAFLD / NASH
Liquid biopsy
Programming:
Scripting: python, R, …
Workflows: Knime, RapidMiner..
Methods
Clustering (ML)
Predictive modelling (ML)
NLP
39
<event
name>
Multimorbidity and Long-Term Conditions
Patients with multimorbidities have the greatest healthcare needs and generate the
highest expenditure in the health system.
There is an increasing focus on identifying specific disease combinations for
addressing poor outcomes.
Matrix factorization / factor analysis
Clustering
Multiple correspondence analysis
Network analysis
…
Which data?
Fragmented / disconnected data sources
 Data access
 Data governance
40
D2P4  NAFLD / non-alcohol fatty liver disease
Physical Activity monitoring
(wearables)
In-patient hospital records
Primary care health records
+ prescriptions
Clinical protocols
Multi-omics
(genomics, transcriptomics,
proteomics, metabolomics…)
Images -- Histology
Early detection of Type 2 Diabetes /
Metabolic / age-related diseases
Early detection of Parkinson’s
Frailty / intrinsic capacity assessment
Multi Morbidity Long Term Conditions (MLTC)
Covid risk / Post-Acute Covid Syndrome (PACS)
Liver disease progression: NAFLD / NASH
Liquid biopsy
• Cleaning
• Integration
• Alignment
• Imputation
• NLP
• …
Programming:
Scripting: python, R, …
Workflows: Knime, RapidMiner..
Methods
Clustering (ML)
Predictive modelling (ML)
Image interpretation / Deep Learning
… “AI”…
41
D2P4  NAFLD / NASH
NASH = non-alcoholic steatohepatitis
Aims:
- integrate cross-sectional and longitudinal outcomes clinical data with
a multi-dimensional ‘omics’ record
- Hypothesis: a precision medicine approach leads to better
understanding of individuals’ trajectories
- Personalised biomarkers  liquid biopsy
Dataset: European NAFLD Registry
7,750 patients with histologically proven NAFLD/NASH
- Omics (cross-sectional)
- Longitudinal follow ups
Methods:
- Precision: clustering
- Anticipating progression: Learn cluster-specific longitudinal models
42
DP4DS: Data Provenance for Data Science
D2P4
+
DP4DS(*)
Physical Activity monitoring
(wearables)
In-patient hospital records
Primary care health records
+ prescriptions
Clinical protocols
Multi-omics
(genomics, transcriptomics,
proteomics, metabolomics…)
Images -- Histology, X-ray, …
Early detection of Type 2 Diabetes /
Metabolic / age-related diseases
Early detection of Parkinson’s
Frailty / intrinsic capacity assessment
Multi Morbidity Long Term Conditions (MLTC)
Covid risk / Post-Acute Covid Syndrome (PACS)
Liver disease progression: NAFLD / NASH
Liquid biopsy
Programming:
Scripting: python, R, …
Workflows: Knime, RapidMiner..
Methods
Clustering (ML)
Predictive modelling (ML)
Image interpretation / Deep Learning
… “AI”…
(plus traditional statistics!)
43
Data  Model  Predictions
Model
pre-processing
Raw
datasets
features
Predicted you:
- Ranking
- Score
- Class
Data
collection
Instances
Key decisions are made during data selection and
processing:
- Where does the data come from?
- What’s in the dataset?
- What transformations were applied?
Complementing current ML approaches to model interpretability
1. Can we explain these decisions?
2. Are these explanations useful?
44
Explaining data preparation
Data
collection
Model
Population data pre-processing
Raw
datasets
features
Predicted you:
- Ranking
- Score
- Class
- Integration
- Cleaning
- Outlier removal
- Normalisation
- Feature selection
- Class rebalancing
- Sampling
- Stratification
- …
Data acquisition and wrangling:
- How were datasets acquired?
- How recently?
- For what purpose?
- Are they being reused /
repurposed?
- What is their quality?
Instances
- Scripts  Python / TensorFlow, Pandas, Spark
- Workflows  Knime, …
Provenance  Transparency
46
Recent early results
A small grassroots project… [1]
- Formalisation of provenance patterns for pipeline operators
- Systematic collection of fine-grained provenance from (nearly) arbitrary pipelines
- Reality check:
- How much does it cost?  provenance volume
- Does it help?  queries against the provenance database
[1]. Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science. Chapman, A., Missier,
P., Simonelli, G., & Torlone, R. PVLDB, 14(4):507-520, January, 2021.
47
Operators
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
Data reduction
- Feature selection
- Instance selection
Data augmentation
- Space transformation
- Instance generation
- Encoding (eg one-hot…)
Data transformation
- Data repair
- Binarisation
- Normalisation
- Discretisation
- Imputation
Ex.: vertical augmentation  adding columns
48
Code instrumentation
Create a provlet for
a specific
transformation
Initialize provenance
capture
…code injection is now being automated!
49
Provenance patterns
50
Provenance templates
Template + binding rules = instantiated provenance fragment
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
{old values: F, I, V}  {new values: F’, J, V’} +
51
This applies to all operators…
52
Putting it all together
53
Evaluation - performance
54
Evaluation: Provenance capture and query times
55
Scalability
56
Summary
Multiple hypotheses regarding Data Provenance for Data Science:
1. Is it practical to collect fine-grained provenance?
1. To what extent can it be done automatically?
2. How much does it cost?
2. Is it also useful?  what is the benefit to data analysts?
Work in progress! Interest? Ideas?
57
Acknowledgments
Prof. Mike Catt
PhD Students: Ben Lam, Philip Darke
MSc student: Sam Butterfield
Prof. Guaraldi
Prof. Mandreoli
MSc student: Davide Ferrari
Prof. Torlone
MSc student: Giulia Simonelli
Prof. Chapman

Weitere ähnliche Inhalte

Was ist angesagt?

Machine Learning for Medical Image Analysis: What, where and how?
Machine Learning for Medical Image Analysis:What, where and how?Machine Learning for Medical Image Analysis:What, where and how?
Machine Learning for Medical Image Analysis: What, where and how?Debdoot Sheet
 
IRJET- Breast Cancer Disease Prediction : Using Machine Learning Approach
IRJET- Breast Cancer Disease Prediction : Using Machine Learning ApproachIRJET- Breast Cancer Disease Prediction : Using Machine Learning Approach
IRJET- Breast Cancer Disease Prediction : Using Machine Learning ApproachIRJET Journal
 
Brainsci 10-00118
Brainsci 10-00118Brainsci 10-00118
Brainsci 10-00118imen jdey
 
Digital radiography physical principles and quality control by euclid seeram ...
Digital radiography physical principles and quality control by euclid seeram ...Digital radiography physical principles and quality control by euclid seeram ...
Digital radiography physical principles and quality control by euclid seeram ...Mohammad Al-Sakran Ayasrah, Ph.D.
 
Intelligent data analysis for medicinal diagnosis
Intelligent data analysis for medicinal diagnosisIntelligent data analysis for medicinal diagnosis
Intelligent data analysis for medicinal diagnosisIRJET Journal
 
BreastScreening: On the Use of Multi-Modality in Medical Imaging Diagnosis
BreastScreening: On the Use of Multi-Modality in Medical Imaging DiagnosisBreastScreening: On the Use of Multi-Modality in Medical Imaging Diagnosis
BreastScreening: On the Use of Multi-Modality in Medical Imaging DiagnosisInstituto Superior Técnico
 
Supervised Multi Attribute Gene Manipulation For Cancer
Supervised Multi Attribute Gene Manipulation For CancerSupervised Multi Attribute Gene Manipulation For Cancer
Supervised Multi Attribute Gene Manipulation For Cancerpaperpublications3
 
BREAST CANCER DIAGNOSIS USING MACHINE LEARNING ALGORITHMS –A SURVEY
BREAST CANCER DIAGNOSIS USING MACHINE LEARNING ALGORITHMS –A SURVEYBREAST CANCER DIAGNOSIS USING MACHINE LEARNING ALGORITHMS –A SURVEY
BREAST CANCER DIAGNOSIS USING MACHINE LEARNING ALGORITHMS –A SURVEYijdpsjournal
 
A Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine PerceptionA Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine PerceptionCory Andrew Henson
 
DENGUE DETECTION AND PREDICTION SYSTEM USING DATA MINING WITH FREQUENCY ANALYSIS
DENGUE DETECTION AND PREDICTION SYSTEM USING DATA MINING WITH FREQUENCY ANALYSISDENGUE DETECTION AND PREDICTION SYSTEM USING DATA MINING WITH FREQUENCY ANALYSIS
DENGUE DETECTION AND PREDICTION SYSTEM USING DATA MINING WITH FREQUENCY ANALYSIScsandit
 
Cancer detection using data mining
Cancer detection using data miningCancer detection using data mining
Cancer detection using data miningRishabhKumar283
 
Intelligent generator of big data medical
Intelligent generator of big data medicalIntelligent generator of big data medical
Intelligent generator of big data medicalNexgen Technology
 
IRJET- Prediction of Heart Disease using RNN Algorithm
IRJET- Prediction of Heart Disease using RNN AlgorithmIRJET- Prediction of Heart Disease using RNN Algorithm
IRJET- Prediction of Heart Disease using RNN AlgorithmIRJET Journal
 
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET Journal
 
Assessing Effectiveness of Information Presentation Using Wearable Augmented ...
Assessing Effectiveness of Information Presentation Using Wearable Augmented ...Assessing Effectiveness of Information Presentation Using Wearable Augmented ...
Assessing Effectiveness of Information Presentation Using Wearable Augmented ...CSCJournals
 
Interactive Visualization Systems and Data Integration Methods for Supporting...
Interactive Visualization Systems and Data Integration Methods for Supporting...Interactive Visualization Systems and Data Integration Methods for Supporting...
Interactive Visualization Systems and Data Integration Methods for Supporting...Don Pellegrino
 
Slima explainable deep learning using fuzzy logic human ist u fribourg ver 17...
Slima explainable deep learning using fuzzy logic human ist u fribourg ver 17...Slima explainable deep learning using fuzzy logic human ist u fribourg ver 17...
Slima explainable deep learning using fuzzy logic human ist u fribourg ver 17...Servio Fernando Lima Reina
 

Was ist angesagt? (20)

Machine Learning for Medical Image Analysis: What, where and how?
Machine Learning for Medical Image Analysis:What, where and how?Machine Learning for Medical Image Analysis:What, where and how?
Machine Learning for Medical Image Analysis: What, where and how?
 
IRJET- Breast Cancer Disease Prediction : Using Machine Learning Approach
IRJET- Breast Cancer Disease Prediction : Using Machine Learning ApproachIRJET- Breast Cancer Disease Prediction : Using Machine Learning Approach
IRJET- Breast Cancer Disease Prediction : Using Machine Learning Approach
 
Brainsci 10-00118
Brainsci 10-00118Brainsci 10-00118
Brainsci 10-00118
 
Digital radiography physical principles and quality control by euclid seeram ...
Digital radiography physical principles and quality control by euclid seeram ...Digital radiography physical principles and quality control by euclid seeram ...
Digital radiography physical principles and quality control by euclid seeram ...
 
Intelligent data analysis for medicinal diagnosis
Intelligent data analysis for medicinal diagnosisIntelligent data analysis for medicinal diagnosis
Intelligent data analysis for medicinal diagnosis
 
BreastScreening: On the Use of Multi-Modality in Medical Imaging Diagnosis
BreastScreening: On the Use of Multi-Modality in Medical Imaging DiagnosisBreastScreening: On the Use of Multi-Modality in Medical Imaging Diagnosis
BreastScreening: On the Use of Multi-Modality in Medical Imaging Diagnosis
 
Supervised Multi Attribute Gene Manipulation For Cancer
Supervised Multi Attribute Gene Manipulation For CancerSupervised Multi Attribute Gene Manipulation For Cancer
Supervised Multi Attribute Gene Manipulation For Cancer
 
BREAST CANCER DIAGNOSIS USING MACHINE LEARNING ALGORITHMS –A SURVEY
BREAST CANCER DIAGNOSIS USING MACHINE LEARNING ALGORITHMS –A SURVEYBREAST CANCER DIAGNOSIS USING MACHINE LEARNING ALGORITHMS –A SURVEY
BREAST CANCER DIAGNOSIS USING MACHINE LEARNING ALGORITHMS –A SURVEY
 
A Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine PerceptionA Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine Perception
 
Madhavi
MadhaviMadhavi
Madhavi
 
DENGUE DETECTION AND PREDICTION SYSTEM USING DATA MINING WITH FREQUENCY ANALYSIS
DENGUE DETECTION AND PREDICTION SYSTEM USING DATA MINING WITH FREQUENCY ANALYSISDENGUE DETECTION AND PREDICTION SYSTEM USING DATA MINING WITH FREQUENCY ANALYSIS
DENGUE DETECTION AND PREDICTION SYSTEM USING DATA MINING WITH FREQUENCY ANALYSIS
 
Breast Cancer
Breast CancerBreast Cancer
Breast Cancer
 
research publication
research publicationresearch publication
research publication
 
Cancer detection using data mining
Cancer detection using data miningCancer detection using data mining
Cancer detection using data mining
 
Intelligent generator of big data medical
Intelligent generator of big data medicalIntelligent generator of big data medical
Intelligent generator of big data medical
 
IRJET- Prediction of Heart Disease using RNN Algorithm
IRJET- Prediction of Heart Disease using RNN AlgorithmIRJET- Prediction of Heart Disease using RNN Algorithm
IRJET- Prediction of Heart Disease using RNN Algorithm
 
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
 
Assessing Effectiveness of Information Presentation Using Wearable Augmented ...
Assessing Effectiveness of Information Presentation Using Wearable Augmented ...Assessing Effectiveness of Information Presentation Using Wearable Augmented ...
Assessing Effectiveness of Information Presentation Using Wearable Augmented ...
 
Interactive Visualization Systems and Data Integration Methods for Supporting...
Interactive Visualization Systems and Data Integration Methods for Supporting...Interactive Visualization Systems and Data Integration Methods for Supporting...
Interactive Visualization Systems and Data Integration Methods for Supporting...
 
Slima explainable deep learning using fuzzy logic human ist u fribourg ver 17...
Slima explainable deep learning using fuzzy logic human ist u fribourg ver 17...Slima explainable deep learning using fuzzy logic human ist u fribourg ver 17...
Slima explainable deep learning using fuzzy logic human ist u fribourg ver 17...
 

Ähnlich wie Data Science for (Health) Science: tales from a challenging front line, and how to cross a few T's

Big data and machine learning: opportunità per la medicina di precisione e i ...
Big data and machine learning: opportunità per la medicina di precisione e i ...Big data and machine learning: opportunità per la medicina di precisione e i ...
Big data and machine learning: opportunità per la medicina di precisione e i ...Fondazione Giannino Bassetti
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
 
EuroBioForum 2013 - Day 1 | Greame Boyle
EuroBioForum 2013 - Day 1 | Greame BoyleEuroBioForum 2013 - Day 1 | Greame Boyle
EuroBioForum 2013 - Day 1 | Greame BoyleEuroBioForum
 
Cloud Computing and Innovations for Optimizing Life Sciences Research
Cloud Computing and Innovations for Optimizing Life Sciences ResearchCloud Computing and Innovations for Optimizing Life Sciences Research
Cloud Computing and Innovations for Optimizing Life Sciences ResearchInterpretOmics
 
Detection of myocardial infarction on recent dataset using machine learning
Detection of myocardial infarction on recent dataset using machine learningDetection of myocardial infarction on recent dataset using machine learning
Detection of myocardial infarction on recent dataset using machine learningIJICTJOURNAL
 
Lessons from COVID-19: How Are Data Science and AI Changing Future Biomedical...
Lessons from COVID-19: How Are Data Science and AI Changing Future Biomedical...Lessons from COVID-19: How Are Data Science and AI Changing Future Biomedical...
Lessons from COVID-19: How Are Data Science and AI Changing Future Biomedical...Jake Chen
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...Paolo Missier
 
Big Data and the Promise and Pitfalls when Applied to Disease Prevention and ...
Big Data and the Promise and Pitfalls when Applied to Disease Prevention and ...Big Data and the Promise and Pitfalls when Applied to Disease Prevention and ...
Big Data and the Promise and Pitfalls when Applied to Disease Prevention and ...Philip Bourne
 
Predictive Analytics and Machine Learning for Healthcare - Diabetes
Predictive Analytics and Machine Learning for Healthcare - DiabetesPredictive Analytics and Machine Learning for Healthcare - Diabetes
Predictive Analytics and Machine Learning for Healthcare - DiabetesDr Purnendu Sekhar Das
 
Proposed Model for Chest Disease Prediction using Data Analytics
Proposed Model for Chest Disease Prediction using Data AnalyticsProposed Model for Chest Disease Prediction using Data Analytics
Proposed Model for Chest Disease Prediction using Data Analyticsvivatechijri
 
Lumiata
LumiataLumiata
LumiataYTH
 
K-Nearest Neighbours based diagnosis of hyperglycemia
K-Nearest Neighbours based diagnosis of hyperglycemiaK-Nearest Neighbours based diagnosis of hyperglycemia
K-Nearest Neighbours based diagnosis of hyperglycemiaijtsrd
 
Health 2.0 for UK SpRs
Health 2.0 for UK SpRsHealth 2.0 for UK SpRs
Health 2.0 for UK SpRsColin Mitchell
 
WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...
WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...
WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...DATAVERSITY
 
GenomeTrakr: Perspectives on linking internationally - Canada and IRIDA.ca
GenomeTrakr: Perspectives on linking internationally - Canada and IRIDA.caGenomeTrakr: Perspectives on linking internationally - Canada and IRIDA.ca
GenomeTrakr: Perspectives on linking internationally - Canada and IRIDA.cafionabrinkman
 
Human Disease and Genomics
Human Disease and GenomicsHuman Disease and Genomics
Human Disease and Genomicsoliai
 
A Systematic Review Of Type-2 Diabetes By Hadoop Map-Reduce
A Systematic Review Of Type-2 Diabetes By Hadoop Map-ReduceA Systematic Review Of Type-2 Diabetes By Hadoop Map-Reduce
A Systematic Review Of Type-2 Diabetes By Hadoop Map-ReduceFinni Rice
 
Healthcare Conference 2013 : Toekomstvisie op ICT in de gezondheidszorg - pro...
Healthcare Conference 2013 : Toekomstvisie op ICT in de gezondheidszorg - pro...Healthcare Conference 2013 : Toekomstvisie op ICT in de gezondheidszorg - pro...
Healthcare Conference 2013 : Toekomstvisie op ICT in de gezondheidszorg - pro...D3 Consutling
 

Ähnlich wie Data Science for (Health) Science: tales from a challenging front line, and how to cross a few T's (20)

Big data and machine learning: opportunità per la medicina di precisione e i ...
Big data and machine learning: opportunità per la medicina di precisione e i ...Big data and machine learning: opportunità per la medicina di precisione e i ...
Big data and machine learning: opportunità per la medicina di precisione e i ...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
EuroBioForum 2013 - Day 1 | Greame Boyle
EuroBioForum 2013 - Day 1 | Greame BoyleEuroBioForum 2013 - Day 1 | Greame Boyle
EuroBioForum 2013 - Day 1 | Greame Boyle
 
Cloud Computing and Innovations for Optimizing Life Sciences Research
Cloud Computing and Innovations for Optimizing Life Sciences ResearchCloud Computing and Innovations for Optimizing Life Sciences Research
Cloud Computing and Innovations for Optimizing Life Sciences Research
 
Detection of myocardial infarction on recent dataset using machine learning
Detection of myocardial infarction on recent dataset using machine learningDetection of myocardial infarction on recent dataset using machine learning
Detection of myocardial infarction on recent dataset using machine learning
 
Lessons from COVID-19: How Are Data Science and AI Changing Future Biomedical...
Lessons from COVID-19: How Are Data Science and AI Changing Future Biomedical...Lessons from COVID-19: How Are Data Science and AI Changing Future Biomedical...
Lessons from COVID-19: How Are Data Science and AI Changing Future Biomedical...
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Big Data and the Promise and Pitfalls when Applied to Disease Prevention and ...
Big Data and the Promise and Pitfalls when Applied to Disease Prevention and ...Big Data and the Promise and Pitfalls when Applied to Disease Prevention and ...
Big Data and the Promise and Pitfalls when Applied to Disease Prevention and ...
 
Predictive Analytics and Machine Learning for Healthcare - Diabetes
Predictive Analytics and Machine Learning for Healthcare - DiabetesPredictive Analytics and Machine Learning for Healthcare - Diabetes
Predictive Analytics and Machine Learning for Healthcare - Diabetes
 
Proposed Model for Chest Disease Prediction using Data Analytics
Proposed Model for Chest Disease Prediction using Data AnalyticsProposed Model for Chest Disease Prediction using Data Analytics
Proposed Model for Chest Disease Prediction using Data Analytics
 
Lumiata
LumiataLumiata
Lumiata
 
K-Nearest Neighbours based diagnosis of hyperglycemia
K-Nearest Neighbours based diagnosis of hyperglycemiaK-Nearest Neighbours based diagnosis of hyperglycemia
K-Nearest Neighbours based diagnosis of hyperglycemia
 
Health 2.0 for UK SpRs
Health 2.0 for UK SpRsHealth 2.0 for UK SpRs
Health 2.0 for UK SpRs
 
WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...
WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...
WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...
 
GenomeTrakr: Perspectives on linking internationally - Canada and IRIDA.ca
GenomeTrakr: Perspectives on linking internationally - Canada and IRIDA.caGenomeTrakr: Perspectives on linking internationally - Canada and IRIDA.ca
GenomeTrakr: Perspectives on linking internationally - Canada and IRIDA.ca
 
Human Disease and Genomics
Human Disease and GenomicsHuman Disease and Genomics
Human Disease and Genomics
 
A Systematic Review Of Type-2 Diabetes By Hadoop Map-Reduce
A Systematic Review Of Type-2 Diabetes By Hadoop Map-ReduceA Systematic Review Of Type-2 Diabetes By Hadoop Map-Reduce
A Systematic Review Of Type-2 Diabetes By Hadoop Map-Reduce
 
Dhruti
DhrutiDhruti
Dhruti
 
Healthcare Conference 2013 : Toekomstvisie op ICT in de gezondheidszorg - pro...
Healthcare Conference 2013 : Toekomstvisie op ICT in de gezondheidszorg - pro...Healthcare Conference 2013 : Toekomstvisie op ICT in de gezondheidszorg - pro...
Healthcare Conference 2013 : Toekomstvisie op ICT in de gezondheidszorg - pro...
 

Mehr von Paolo Missier

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...Paolo Missier
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...Paolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyPaolo Missier
 
Provenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-ComputationProvenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-ComputationPaolo Missier
 
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Paolo Missier
 
Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...Paolo Missier
 

Mehr von Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparency
 
Provenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-ComputationProvenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-Computation
 
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
 
Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...
 

Kürzlich hochgeladen

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Kürzlich hochgeladen (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Data Science for (Health) Science: tales from a challenging front line, and how to cross a few T's

  • 1. Data Science for (Health) Science: tales from a challenging front line, and how to cross a few T's Paolo Missier School of Computing Newcastle University, UK March 2021 A talk given to The School of Information Sciences Center for Informatics Research in Science and Scholarship University of Illinois Urbana-Champaign paolo.missier@ncl.ac.uk LinkedIn: paolomissier Twitter: @PMissier
  • 2. 2 The message: 1. “Data Science” for Health is hard. The hard part is the data 2. “AI for Health” is (Deep) Machine Learning 3. Ethics. Fairness. Trust. Acceptance. 4. Data Provenance for Data Science: Solution or distraction? • Transparency • Trustworthiness • Traceability
  • 4. 4 AI for healthcare – the UK landscape https://www.turing.ac.uk/research/research-programmes/health-and-medical-sciences AI and data science will improve the detection, diagnosis, and treatment of illness. They will optimise the provision of services, and support health service providers to anticipate demand and deliver improved patient care. • Explainability / Interpretability • Exploiting EHR (Electronic Health Records) • Image interpretation • Fairness, Bias • Ethical issues in … • Predicting <disease / critical event> …
  • 5. 5 Personalised, Predictive, Preventive, Participatory Medicine (P4) Price ND, Magis AT, Earls JC, et al. A wellness study of 108 individuals using personal, dense, dynamic data clouds. Nat Biotechnol. 2017;35:747.
  • 6. 6 (*) Data-Driven, Personalised, Predictive, Preventive, Participatory D2P4 (*) Healthcare research • Cleaning • Integration • Alignment • Imputation • NLP • … Physical Activity monitoring (wearables) In-patient hospital records Primary care health records + prescriptions Clinical protocols Multi-omics (genomics, transcriptomics, proteomics, metabolomics…) Images -- Histology, X-ray, … Early detection of Type 2 Diabetes / Metabolic / age-related diseases Early detection of Parkinson’s Frailty / intrinsic capacity assessment Multi Morbidity Long Term Conditions (MLTC) Covid risk / Post-Acute Covid Syndrome (PACS) Liver disease progression: NAFLD / NASH Liquid biopsy Programming: Scripting: python, R, … Workflows: Knime, RapidMiner.. Methods Clustering (ML) Predictive modelling (ML) Image interpretation / Deep Learning … “AI”… (plus traditional statistics!)
  • 7. 7 Big Data for Health Care Genomics for personalized medicine personal monitors / wearables Medical Records Article Source: Big Data: Astronomical or Genomical? Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, et al. (2015) Big Data: Astronomical or Genomical?. PLOS Biology 13(7): e1002195. https://doi.org/10.1371/journal.pbio.1002195
  • 8. 9 D2P4  Accelerometry Physical Activity monitoring (wearables) In-patient hospital records Primary care health records + prescriptions Clinical protocols Multi-omics (genomics, transcriptomics, proteomics, metabolomics…) Images -- Histology, X-ray, … Early detection of Type 2 Diabetes / Metabolic / age-related diseases Early detection of Parkinson’s Frailty / intrinsic capacity assessment Multi Morbidity Long Term Conditions (MLTC) Covid risk / Post-Acute Covid Syndrome (PACS) Liver disease progression: NAFLD / NASH Liquid biopsy • Cleaning • Integration • Alignment • Imputation • NLP • … Programming: Scripting: python, R, … Workflows: Knime, RapidMiner.. Methods Clustering (ML) Predictive modelling (ML)
  • 9. 10 Digital biomarkers Digital biomarkers come from "novel sensing systems capable of continuously tracking behavioral signals […] capture people's everyday routines, actions, and physiological changes that can explain outcomes related to health, cognitive abilities, and more” (Choudhury 2018). Choudhury, Tanzeem. 2018. “Making Sleep Tracking More User Friendly.” Communications of the ACM 61 (11): 156–156. https://doi.org/10.1145/3266285 - physical activity - glucose levels - blood oxygen levels - … Inexpensive  scalable personalised self-monitoring
  • 10. 11 A first project: markers from accelerometers? Initial study Digital biomarkers + UK Biobank Dataset + Type 2 Diabetes outcome - physical activity - glucose levels - blood oxygen levels - … Aligned with the P4 agenda Readily available dataset (+) 3,500+ features (+) multi-omics coverage (+) genomics (+) links to EHR (+) Activity monitors made in Newcastle! (-) Limited follow-ups – little longitude (-) Population not random (-) Activity data / person very limited 100K Activity traces
  • 11. 13 Using wearable activity trackers to predict Type-2 Diabetes Objective: To determine the extent to which accelerometer traces can be used to distinguish individuals with Type-2 Diabetes (T2D) from normoglycaemic controls, and to quantify their limitations. Lam B, Catt M, Cassidy S, Bacardit J, Darke P, Butterfield S, Alshabrawy O, Trenell M, Missier P Using Wearable Activity Trackers to Predict Type 2 Diabetes: Machine Learning–Based Cross-sectional Study of the UK Biobank Accelerometer Cohort -- JMIR Diabetes. 20/01/2021:23364 (forthcoming/in press) Feature extraction Clustering Classification ??
  • 12. 14 Granular activity representation feature extraction 60 features / day
  • 13. 15 Filter: Accelerometry study? 103,712 Split criteria: Type 2 Diabetes? At baseline: 2,755 Through EHR analysis: 1,321 Total: 4,076 Non-Diabetes 99,636 Filter: EHR data available? 19,852 502, 664 All UK Biobank participants: Filter: QC on activity traces 3,103 Positives: T2D vs Norm-0 Physical Impairment analysis Severe impairment 1,666 No impairment 8,463 A great UG project! your (biomedical) dataset may not be as big as it looks T2D vs Norm-1
  • 14. 16 (some) results Negatives: HLAF SDL HLAF+SDL Norm-0 Norm-2 Norm-0 Norm-2 Norm-0 Norm-2 RF .80 .68 .83 .78 .86 .77 LR .79 .70 .83 .78 .86 .78 XGB .78 .66 .80 .74 .85 .75
  • 15. 17 Ongoing work Are there better embedded representations for acceleremetry data? Can they be used as predictors for other outcomes? Representation learning Embedded feature space LSTM Autoencoder Outcome: Insulin sensitivity DIRECT DB Standard classification
  • 16. 19 D2P4  COVID Physical Activity monitoring (wearables) In-patient hospital records Primary care health records + prescriptions Clinical protocols Multi-omics (genomics, transcriptomics, proteomics, metabolomics…) Images -- Histology, X-ray, … Early detection of Type 2 Diabetes / Metabolic / age-related diseases Early detection of Parkinson’s Frailty / intrinsic capacity assessment Multi Morbidity Long Term Conditions (MLTC) Covid risk / Post-Acute Covid Syndrome (PACS) Liver disease progression: NAFLD / NASH Liquid biopsy Programming: Scripting: python, R, … Workflows: Knime, RapidMiner.. Methods Clustering (ML) Predictive modelling (ML)
  • 17. D. Ferrari1, Prof. F. Mandreoli1, Prof. G. Guaraldi2 Prof. P. Missier Predicting respiratory failure in patients with COVID-19 pneumonia: a case study from Northern Italy Peak of Italian Covid crisis (March 2020 onwards) Issue: ICU Capacity Question: will my next patient require ICU resources? How soon? (1) (2) Machine learning in predicting respiratory failure in patients with COVID-19 pneumonia—Challenges, strengths, and opportunities in a global health emergency Ferrari D, Milic J, Tonelli R, Ghinelli F, Meschiari M, et al. (2020) Machine learning in predicting respiratory failure in patients with COVID-19 pneumonia—Challenges, strengths, and opportunities in a global health emergency. PLOS ONE 15(11): e0239172. https://doi.org/10.1371/journal.pone.0239172
  • 18. 21 Study structure Applied Machine Learning driven by a clinical question An example of typical data science pattern: • Data selection  inclusion, exclusion criteria • Data preparation / cleaning • Variable selection • Model learning  multiple models • Model evaluation With additional challenges: “Live” evolving dataset with multiple versions of a patients database • changes in recording practices • Inconsistencies • Lots of missing data Small data: 198 patients  1068 observations  31-90 variables (symptoms, lab biomarkers) In the data collection period, the dataset was growing daily with the average of 84 new records per day, with a mean of 10 new data points/patient. out of the initial sample of 295 patients and 2,889 data points available, 198 patients contributed to generate 1068 valuable observations. In detail, 603 observations contributed to the definition of respiratory failure (PaO2/ FiO2 < 150 mmHg) and 465 did not meet this definition. Each data point included a complex record of observations from multiple categories: (1) signs and symptoms, (2) blood biomark- ers, (3) respiratory assessment with PaO2/FiO2, (4) history of comorbidities (available in a sub- set of 119 patients). Some variables were collected daily, and others were recorded upon clinical indications.
  • 19. 22 A case study to illustrate the problem
  • 20. 24 Modelling Requirements • Parsimonious  few variables • Robust to missing data  imputation not an option • Explainable  Trust • model reveals the relative importance of each variable for each prediction it makes • Minimize the number of false negatives • risk of under-estimating the severity of a patient’s condition
  • 21. 26 Approach • Parsimonious  feature ranking and selection • Robust to missing data • Explainable  Shapley values • Minimize FN  bespoke loss function Ensemble of Decision trees
  • 22. 27 Testing multiple models - Results Parsimony: Model 1 - suboptimal prediction accuracy Model 2: Adding biomarkers including respiratory variables increased performance Model 3: boosted mixed model - still requires about 20 variables From a physician’s perspective, a cluster of 20 variables may be difficult to manage in routine clinical practice. What our approach offers in support to the decision-making process is a simple interpretation of the predictions.
  • 23. 28 Which are the most important predictors? Shap values
  • 24. 29 Summary Good results on “live” data, predicting a useful outcome for the purpose of ICU management Major selling points: • Variables (relatively) easy to collect in routine visits and in-hospital • Models are explainable, medics can reality-check against their own understanding … Opened the door to further collaborations: New project on PACS: Post-Acute Covid Syndrome: Following up recovery paths for 300 patients across 5 hospitals
  • 25. 30 D2P4  EHR analysis for dynamic risk prediction D2P4 (*) Healthcare research Physical Activity monitoring (wearables) In-patient hospital records Primary care health records + prescriptions Clinical protocols Multi-omics (genomics, transcriptomics, proteomics, metabolomics…) Images -- Histology, X-ray, … Early detection of Type 2 Diabetes / Metabolic / age-related diseases Early detection of Parkinson’s Frailty / intrinsic capacity assessment Multi Morbidity Long Term Conditions (MLTC) Covid risk / Post-Acute Covid Syndrome (PACS) Liver disease progression: NAFLD / NASH Liquid biopsy Programming: Scripting: python, R, … Workflows: Knime, RapidMiner.. Methods Clustering (ML) Predictive modelling (ML) Survival analysis Longitudinal prediction models
  • 26. 31 Longitudinal data: Health-related events https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41270 UK Biobank - Primary Care Linked Data
  • 27. 32 Clinical Risk Prediction Models Healthy participant or missing data/under- reported conditions? Number/pattern of records is a proxy for health? Informed presence bias Individuals in EHR data are systematically different to those who are not (Goldstein et al, 2016)
  • 28. 36 Case study: Type 2 Diabetes ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 30 40 50 60 70 80 Pre⌧diabetes D iabetes R em ission G lycated hem oglobin HbA 1c (m m ol/m ol) Participant: R ED A C T ED ● ● ● ● 4 6 8 10 12 ● Prim ary care records U K B B visit ● ● ● N orm oglycaem ic Pre⌧diabetic D iabetic Fasting plasm a glucose (m m ol/l) P r im a r y ca r e Se con d a r y ca r e E v e n t O b s D r u g D ia g O p 1987 (age X ) 1991 (age X ) 1995 (age X ) 1999 (age X ) 2003 (age X ) 2007 (age X ) 2011 (age X ) 2015 (age X ) Estim ated observation period R ecord D iabetes record Electronic health records Figure 17: Example output of the phenotyping tool. 39
  • 29. 37 Case study: Type 2 Diabetes – remission study Type 2 diabetes remission Longitudinal phenotyping with large–scale observational data Philip Darke EPSRC Centre for Doctoral Training in Cloud Computing for Big Data Newcastle University UK Biobank is a UK–based prospective study into illness in mid- ukbiobank.ac.uk dle and old age with over 500,000 participants. Diabetes is one of the most prevalent conditions in the cohort with nearly 70,000 diag- noses2 expected by 2027. Study data is collected at participant visits 2 Naomi Allen, et al. UK Biobank: Current status and what it means for epidemiology. Health Policy and Technology, 1(3):123–126, September 2012. doi : 10.1016/ j.hlpt.2012.07.003 and via linkage to national datasets including EHR data. These data have been used to longitudinally phenotype over 200,000 partici- pants for diabetes as illustrated in figure 1. The approach will be expanded to all participants when further data is released. ● ● ● ● ● ● ● ● ● ● ● 30 40 50 60 HbA1c (mmol/mol) Pre−diabetes Type 2 diabetes Remission ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 70 80 90 100 Weight (kg) Biguanides 12.5 15.0 17.5 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 Figure 1: Model output showing HbA1c, weight, periods of medication and inferred diabetic status for an example participant. Long–term remission was achieved by sustained weight loss post diagnosis. Many of those diagnosed with type 2 diabetes experience a sub- sequent period of remission. Some relapse whilst others achieve long–term remission and cease anti–diabetes medication. This project will examine the pathways to remission at scale using ob-
  • 30. 38 D2P4  MLTC-M Physical Activity monitoring (wearables) In-patient hospital records Primary care health records + prescriptions Clinical protocols Multi-omics (genomics, transcriptomics, proteomics, metabolomics…) Images -- Histology, X-ray, … Early detection of Type 2 Diabetes / Metabolic / age-related diseases Early detection of Parkinson’s Frailty / intrinsic capacity assessment Multiple Long Term Conditions (MLTC) Covid risk / Post-Acute Covid Syndrome (PACS) Liver disease progression: NAFLD / NASH Liquid biopsy Programming: Scripting: python, R, … Workflows: Knime, RapidMiner.. Methods Clustering (ML) Predictive modelling (ML) NLP
  • 31. 39 <event name> Multimorbidity and Long-Term Conditions Patients with multimorbidities have the greatest healthcare needs and generate the highest expenditure in the health system. There is an increasing focus on identifying specific disease combinations for addressing poor outcomes. Matrix factorization / factor analysis Clustering Multiple correspondence analysis Network analysis … Which data? Fragmented / disconnected data sources  Data access  Data governance
  • 32. 40 D2P4  NAFLD / non-alcohol fatty liver disease Physical Activity monitoring (wearables) In-patient hospital records Primary care health records + prescriptions Clinical protocols Multi-omics (genomics, transcriptomics, proteomics, metabolomics…) Images -- Histology Early detection of Type 2 Diabetes / Metabolic / age-related diseases Early detection of Parkinson’s Frailty / intrinsic capacity assessment Multi Morbidity Long Term Conditions (MLTC) Covid risk / Post-Acute Covid Syndrome (PACS) Liver disease progression: NAFLD / NASH Liquid biopsy • Cleaning • Integration • Alignment • Imputation • NLP • … Programming: Scripting: python, R, … Workflows: Knime, RapidMiner.. Methods Clustering (ML) Predictive modelling (ML) Image interpretation / Deep Learning … “AI”…
  • 33. 41 D2P4  NAFLD / NASH NASH = non-alcoholic steatohepatitis Aims: - integrate cross-sectional and longitudinal outcomes clinical data with a multi-dimensional ‘omics’ record - Hypothesis: a precision medicine approach leads to better understanding of individuals’ trajectories - Personalised biomarkers  liquid biopsy Dataset: European NAFLD Registry 7,750 patients with histologically proven NAFLD/NASH - Omics (cross-sectional) - Longitudinal follow ups Methods: - Precision: clustering - Anticipating progression: Learn cluster-specific longitudinal models
  • 34. 42 DP4DS: Data Provenance for Data Science D2P4 + DP4DS(*) Physical Activity monitoring (wearables) In-patient hospital records Primary care health records + prescriptions Clinical protocols Multi-omics (genomics, transcriptomics, proteomics, metabolomics…) Images -- Histology, X-ray, … Early detection of Type 2 Diabetes / Metabolic / age-related diseases Early detection of Parkinson’s Frailty / intrinsic capacity assessment Multi Morbidity Long Term Conditions (MLTC) Covid risk / Post-Acute Covid Syndrome (PACS) Liver disease progression: NAFLD / NASH Liquid biopsy Programming: Scripting: python, R, … Workflows: Knime, RapidMiner.. Methods Clustering (ML) Predictive modelling (ML) Image interpretation / Deep Learning … “AI”… (plus traditional statistics!)
  • 35. 43 Data  Model  Predictions Model pre-processing Raw datasets features Predicted you: - Ranking - Score - Class Data collection Instances Key decisions are made during data selection and processing: - Where does the data come from? - What’s in the dataset? - What transformations were applied? Complementing current ML approaches to model interpretability 1. Can we explain these decisions? 2. Are these explanations useful?
  • 36. 44 Explaining data preparation Data collection Model Population data pre-processing Raw datasets features Predicted you: - Ranking - Score - Class - Integration - Cleaning - Outlier removal - Normalisation - Feature selection - Class rebalancing - Sampling - Stratification - … Data acquisition and wrangling: - How were datasets acquired? - How recently? - For what purpose? - Are they being reused / repurposed? - What is their quality? Instances - Scripts  Python / TensorFlow, Pandas, Spark - Workflows  Knime, … Provenance  Transparency
  • 37. 46 Recent early results A small grassroots project… [1] - Formalisation of provenance patterns for pipeline operators - Systematic collection of fine-grained provenance from (nearly) arbitrary pipelines - Reality check: - How much does it cost?  provenance volume - Does it help?  queries against the provenance database [1]. Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science. Chapman, A., Missier, P., Simonelli, G., & Torlone, R. PVLDB, 14(4):507-520, January, 2021.
  • 38. 47 Operators 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 op Data reduction - Feature selection - Instance selection Data augmentation - Space transformation - Instance generation - Encoding (eg one-hot…) Data transformation - Data repair - Binarisation - Normalisation - Discretisation - Imputation Ex.: vertical augmentation  adding columns
  • 39. 48 Code instrumentation Create a provlet for a specific transformation Initialize provenance capture …code injection is now being automated!
  • 41. 50 Provenance templates Template + binding rules = instantiated provenance fragment 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 op {old values: F, I, V}  {new values: F’, J, V’} +
  • 42. 51 This applies to all operators…
  • 43. 52 Putting it all together
  • 47. 56 Summary Multiple hypotheses regarding Data Provenance for Data Science: 1. Is it practical to collect fine-grained provenance? 1. To what extent can it be done automatically? 2. How much does it cost? 2. Is it also useful?  what is the benefit to data analysts? Work in progress! Interest? Ideas?
  • 48. 57 Acknowledgments Prof. Mike Catt PhD Students: Ben Lam, Philip Darke MSc student: Sam Butterfield Prof. Guaraldi Prof. Mandreoli MSc student: Davide Ferrari Prof. Torlone MSc student: Giulia Simonelli Prof. Chapman

Hinweis der Redaktion

  1. CVD leading cause of death for males (15.5%) and seconds for females (8.8%) in 2015 (*)
  2. How about the data used to train / build the model?
  3. Relatively easy to keep track of data pre-processing  provenance
  4. \newcommand{\f}{\textbf{a}} \text{features}~ X=[\f_1 \ldots \f_k] \text{new features}~ Y=[\f'_1 \ldots \f'_l] \noindent new values for each row are  obtained by applying $f$\\ to values in the $X$ features