Presenter: Kenneth Young, Ph.D. Assistant Professor, Health Informatics Institute, University of South Florida
Abstract
Type 1 diabetes (T1D) is a complex and heterogenous autoimmune disease that is no longer considered a clear-cut clinically diagnosed disease. T1D is multifaceted and the efficacy of therapeutic interventions varies greatly. With the evidence of etiological differences in T1D and the availability of high-dimensional multi-omics data in combination with clinical and environmental data, this project aims to use an artificial intelligence (AI) exploratory approach that may aid in the identification of new markers to predict IA and T1D.
This project utilizes data from NIDDK funded by The Environmental Determinants of Diabetes in the Young (TEDDY) study. TEDDY has generated over 900TB of diverse data types including multi-omics data, deep phenotyping, and environmental factor measurements every three-six months for fifteen years. We utilize deep learning methods, such as convolutional neural networks (CNN) and recurrent neural networks (RNN) that apply bidirectional long short-term memory (LSTM), in combination with multi-layer perceptron (MLP), to evaluate the prediction of IA and T1D. To aid in T1D predication, this project uses innovative and transformative AI approaches that combine temporal and static data, which may ultimately provide insights into the complex heterogeneity, diversity, and pathogenesis of T1D. The knowledge gained from this project may not only help advance the T1D community, but may have a broad impact on a variety of autoimmune diseases such as celiac and thyroid diseases which frequently coexist and share genetic susceptibility to T1D.
3. Artificial Intelligence
• Artificial Intelligence (AI) refers to machines or
computers that emulate the cognitive functions
associated with the human mind.
• Learning
• Problem solving
• Popular branch of computer science with aims to build
intelligent machines capable of performing intelligent
tasks.
• Activities that would necessitate human intelligence.
May 13, 2022 Integrative AI approach to predict T1D 3
4. AI Progression
• Deep learning algorithms started to become mainstream in the early
2010’s.
• Deep learning algorithms represent AI techniques that have neural
networks capable of unsupervised learning from data that are
unstructured.
May 13, 2022 Integrative AI approach to predict T1D 4
5. AI Learning Approaches
• Three related concepts:
• Artificial Intelligence (AI)
• Machine learning (ML)
• Deep learning (DL)
• Three primary types of learning:
• Supervised
• Unsupervised
• Reinforcement
May 13, 2022 Integrative AI approach to predict T1D 5
6. Deep Learning Model Types
May 13, 2022 Integrative AI approach to predict T1D 6
Deep
Learning
Model
Types
CNNs
DNNs
LSTMs
RNNs
GANs
GANs
RBFNs
MLPs
SOMs
DBNs
RBMs
AEs
• Deep Learning Model Types:
• Convolutional Neural Networks (CNNs)
• Long Short Term Memory Networks (LSTMs)
• Recurrent Neural Networks (RNNs)
• Generative Adversarial Networks (GANs)
• Radial Basis Function Networks (RBFNs)
• Multilayer Perceptrons (MLPs)
• Self Organizing Maps (SOMs)
• Deep Belief Networks (DBNs)
• Restricted Boltzmann Machines( RBMs)
• Autoencoders (AEs)
7. Deep Learning
• The artificial neural networks are built like
the human brain, with neuron nodes
connected like a web.
• Deep learning maps inputs to outputs and
finds correlations.
• Deep learning is composed of several layers.
• The layers consist of nodes “neurons”.
• Hidden layers are those layers with nodes
other than the input and output nodes and
allow for non-linearity. Resolve the XOR, or
“exclusive or”, problem in artificial neural
network (ANN) research.
• The output of some deep learning models,
such as LSTMs, feeds the input of the next
layer and memorizes previous inputs
through internal memory.
May 13, 2022 Integrative AI approach to predict T1D 7
8. Using AI (Deep Learning)
• Feature learning: Hierarchy of
increasing complexity and
abstraction
• Makes deep learning networks
capable of handling very large, high-
dimensional data sets with billions of
parameters that pass through
nonlinear functions
• Capable of discovering latent
structures within unlabeled,
unstructured data: [Text, pictures,
video, audio]
May 13, 2022 Integrative AI approach to predict T1D 8
Example of feature hierarchy learned by a deep learning model on faces from Lee et al. (2009).
9. AI Prediction
• Google’s DeepMInd predicts likelihood of a patient developing acute
kidney injury (AKI), a life threatening condition
• University of Nottingham developed AI to predict the risk of early death
due to chronic diseases in a large middle-aged population
• AI used to predict progression of diabetic kidney disease
• IBM’s Watson can diagnosis heart disease
• Deep learning used for the prediction of variant effects on expression and
disease risk
• Deep learning for inferring gene relationships from single-cell expression
data
• Prediction of Alzheimer’s Disease Based on Bidirectional LSTM
May 13, 2022 Integrative AI approach to predict T1D 9
10. AI Prediction
• Researchers at MIT and Wyss Institute at
Harvard University developed an AI tool to
aid in the detection of skin cancer
• Successfully distinguished SPLs from non-
suspicious lesions in photos of patients’ skin
with ~90% accuracy
• A pre-trained deep convolutional neural
network (DCNN) determines the
suspiciousness of individual pigmented
lesions
• Yellow = consider further inspection
• Red = requires further inspection or referral to
dermatologist
May 13, 2022 Integrative AI approach to predict T1D 10
Images source: Harvard University researchers
11. dkNET Project: Integrative Artificial
Intelligence Approach to Predict T1D
• Goals:
• Use artificial intelligence (AI) and machine learning (ML) tools to develop
novel computational approaches for synthesis and analysis of multi-omics,
clinical, and environmental data to evaluate the prediction of type 1 diabetes
(T1D).
• Develop an AI framework and pipelines for fusion and analysis of multi-level
and multi-scale data.
• AI capable of learning temporal and static data.
May 13, 2022 Integrative AI approach to predict T1D 11
12. Underlying Assumptions
• Results may not be applied to general T1D population.
• Using data from TEDDY nested case-control study (NCCS) population, we will
examine the TEDDY T1D and persistent and confirmed multiple-autoantibody
(IA) diagnosed patient populations of the nested case-control cohort.
• Quality control (QC) metrics, background correction, and data normalization
are performed on applicable data prior to analysis.
Integrative AI approach to predict T1D 12
May 13, 2022
13. TEDDY Study
• This project utilizes data from an NIDDK funded study called The
Environmental Determinants of Diabetes in the Young (TEDDY) study.
•The TEDDY study investigates:
• Genetic and genetic-environmental interactions, including gestational
infection or other gestational events.
• Childhood infections or other environmental factors after birth in relation to
the development of prediabetes autoimmunity and T1D.
May 13, 2022 Integrative AI approach to predict T1D 13
14. TEDDY Study Continued
•The long-term goal of the TEDDY study is the identification of factors
which trigger T1D in genetically susceptible individuals or which
protect against the disease.
• Identification of such factors will lead to a better understanding of
disease pathogenesis and result in new strategies to prevent, delay or
reverse T1D.
•The TEDDY participants are followed for 15 years for the appearance
of various beta-cell autoantibodies and diabetes.
•The participants are no longer followed once they reach a study
endpoint.
May 13, 2022 Integrative AI approach to predict T1D 14
15. TEDDY Study Centers
Integrative AI approach to predict T1D 15
Florida
Georgia
Washington
Colorado
Germany
Finland
Sweden
May 13, 2022
16. TEDDY Study Research Findings
• This work builds upon published TEDDY results.
• TEDDY researchers have found that autoimmune destruction of
insulin producing cells typically begins in the first two years of life.
Integrative AI approach to predict T1D 16
May 13, 2022
17. Type 1 Diabetes
• Type 1 diabetes (T1D) is a complex autoimmune disease resulting in
the destruction of β-cells leading to deficient insulin production
overtime.
• Diabetes is a worldwide epidemic, prevalence figures estimate 382
million people living with diabetes in 2013. By 2035 this number is
projected to rise to 592 million.
• The prevalence of T1D in individuals younger than 20 years of age has
increased by 23%, from 2001 to 2009.
• The CDC’s 2020 National Diabetes Statistics report shows the
prevalence of T1D in the U.S. increased nearly 30% from 2017 to
2020.
Integrative AI approach to predict T1D 17
May 13, 2022
18. T1D Triggers and Risk Factors
• TEDDY researchers believe that T1D may be triggered by environmental factors
and genetic traits.
• The main genes associated with T1D are human leukocyte antigen (HLA) DR3,
DR3-DQ2, DR4, and DR4-DQ8.
• T1D risks may be the same for the general population and genetically at risk.
• Genetic and immunopathogenic studies have directly implicated cytokines in the
pathogenesis of T1D.
• At least five autoantibody-reactivities are predictive of T1D:
• ICA, IAA, GADA, IA-2A, and ZnT8A.
• Autoantibodies against insulin (IAA) are usually the first to occur in children at risk for T1D.
• The association between antibody prevalence and T1D confirms the importance
of antibody detection in at risk individuals, prior to clinical onset.
Integrative AI approach to predict T1D 18
May 13, 2022
19. AI Methods
• A data-driven AI approach to explore the TEDDY study data to predict
T1D. This may provide insight into the important features that may
cause or protect against T1D.
• Deep Learning Neural Networks:
• Recurrent Neural Networks (RNN, LSTM)
• Multilayer Perceptrons (MLP)
• Combination of Neural Networks
• Static and Sequential Feature Modeling
• TEDDY Data:
• Static: Family history, maternal history, SNPs
• Dynamic: Proteomics, gene expression, test results, clinical visits
Integrative AI approach to predict T1D 19
May 13, 2022
20. AI Tools
• The primary development is programmed in Python version 3.7.
• For AI modeling, TensorFlow version 2.5.0 and Keras version 2.3.1
are used.
• Statistical and machine learning programming are performed
using the Visual Studio Community 2019 integrated development
environment version 16.10.2.
• The Snakemake workflow management tool is used to create
reproducible and scalable data analysis to run on high
performance computing (HPC) environments.
• GitHub is used for version control and source code management.
• Note: These versions may change during the course of this project
Integrative AI approach to predict T1D 20
May 13, 2022
21. AI Data
• Utilized from the TEDDY study.
• Limited to the first two years of data from the TEDDY nested case-
control study (NCCS) for first iteration of the AI model.
• From case-control cohort of genetically at risk (HLA-susceptibility
genotypes) study participants from the TEDDY study.
• Comprises temporal (time-series) and static features.
• Diverse types including multi-omics data and environmental factor
measurements every three-six months for fifteen years.
Integrative AI approach to predict T1D 21
May 13, 2022
22. AI Workflow
Integrative AI approach to predict T1D 22
May 13, 2022
Acquire Data
1. Static
2. Temporal
Preprocess Data
1. Mask
2. Drop
3. Aggregate
4. Normalize
Data Imputation
1. Static: Mean, Median,
Most Frequent
2. Temporal:
Interpolate, LOCF,
NOCB, Case Complete
Split Data
1. Train
2. Validation
3. Test
Feature Selection
1. SVC
2. SVM
3. Random
Forest
4. None
Scale Data
1. Static
2. Temporal
AI Model
1. Compile
2. Fit
3. Predict
4. Evaluate
AI Model Outputs
1. Accuracy
2. Loss
3. Precision
4. Recall
5. AUC/ROC
6. Predictions
AI Model
Interpretation
1. Feature
Importance
• Static
• Temporal
AI Model
Interpretation
Visualization
1. SHAP Plots
23. Preprocessing: Data Acquisition
• Load various data files acquired from the TEDDY Study.
• Data are limited to TEDDY NCCS cohorts.
• Consists of time-series (temporal) and static data.
• Various –omics, environmental, and exposure data
Integrative AI approach to predict T1D 23
May 13, 2022
24. Preprocessing: Data Masking
• The masking process de-identifies or obfuscates the data.
• Enhance the privacy and security of the data.
• Prevents potential research bias.
Integrative AI approach to predict T1D 24
May 13, 2022
25. Preprocessing: Data Aggregation
• Aggregate various data from TEDDY study which can include
numerous data such as –omics, environmental, diet, exposures
• These data are in various structures that can require proper
knowledge and time to prepare and aggregate.
• Data may be: grouped, joined, pivoted, transposed, and merged.
• Data are from various sources and must be joined.
• The data must be transformed to 3D time-series structure to feed the
LSTM network.
Integrative AI approach to predict T1D 25
May 13, 2022
26. Preprocessing: Data Dropping
• Certain data columns “features” are dropped as they obfuscate the
data.
• Data that do not meet the level of detection “LOD” are either
dropped or used as binary.
• Censor data, drop data with time points > 24 months.
• Data dropped add no value to the model and only hinder the model's
predictive power.
• Example data dropped: participant identifiers, vial barcode numbers,
collection dates of specimens, samples below LOD.
Integrative AI approach to predict T1D 26
May 13, 2022
27. Data Imputation
• When missing values exist in a dataset, it is important to reason why the data are missing
and how their missingness may impact the data analysis or false conclusions reached.
• If we ignore these missing data, statistical power may reduce. Even more important is the
potential to bias answers which may misleadingly point to incorrect conclusions.
• With the AI Framework, various imputation methods can be employed:
• Temporal Data:
1. Interpolate – used by AI model
2. Last observation carry forward (LOCF)
3. Next observation carry backward (NOCB)
4. Case Complete (CC)
• Static Data:
1. Mean
2. Median – used by AI model
3. Most Frequent
Integrative AI approach to predict T1D 27
May 13, 2022
28. Feature Selection
• Feature selection is the process of reducing the number of input variables when
developing a predictive AI model.
• It may be beneficial to reduce the number of input variables:
• Reduce computational cost
• Improve performance of AI model
• Various methods exist to select the important features of the data:
• Random Forest
• Support Vector Machine
• Support Vector Classification
• Select K-Best
• Chi-square
Integrative AI approach to predict T1D 28
May 13, 2022
29. Data Splitting
• The data are split into training, validation,
and test datasets.
• The AI framework uses skleans
train_test_split function.
• Options: Stratified Shuffle Split, K-Folds cross
validation iterator.
• Balance data: Dataset splits are stratified by
y-value.
• The train_test_split splits the original dataset in
such a way that the proportion of both classes
(binary classification) is preserved in the training,
validation, and test datasets.
• Splits ((training, validation), test): (80:20):10
Integrative AI approach to predict T1D 29
May 13, 2022
30. Data Scaling
• AI algorithms perform better when numerical input variables are scaled to
a standard range.
• Common to scale data prior to fitting neural network model.
• Data often consists of many different input variables or features (columns)
and each may have a different range of values or units of measure.
• Robust standardization (robust data scaling) is used as it can accommodate
data with outliers. This Scaler removes the median and scales the data
according to the quantile range (defaults to IQR: Interquartile Range)
• Data are scaled and transformed. Sklearn (Sci-kit) Scalar fit_transform is
used to standardize features. Training data are fit and transformed, while
validation and test data are only transformed.
Integrative AI approach to predict T1D 30
May 13, 2022
31. AI Deep Learning Model
• We developed a deep learning model
capable of learning from time-series
features and static features.
• Uses keras and tensorflow to construct
multi-layer perceptron neural networks
(MLP) and recurrent neural networks
(RNN)
• Concatenate neural networks
Integrative AI approach to predict T1D 31
May 13, 2022
Simple MLP
Simple RNN
32. AI Model
• A neural network framework was developed that can combine
temporal and static data.
• The TEDDY data are temporal (time-series) and comprise
various –omics data in addition to static data (environment,
diet, exposure, SNP).
• The temporal samples are collected at three months intervals
(time-series).
• The data outcome is binary (0 or 1). 1 represents persistent
confirmed islet autoimmunity (IA), which in the TEDDY study,
is defined using MIAA, GADA, and IA2A autoantibodies.
• Future work will predict T1D as the binary outcome.
Integrative AI approach to predict T1D 32
May 13, 2022
33. AI Model Hyperparameter tuning
• Building AI models is an iterative process to optimize the
model’s performance and compute resources.
• The settings adjusted during each iteration are called
hyperparameters, which govern the training process.
• A hyperparameter is a training parameter set by an AI
engineer before training the model. These parameters are not
learned by the machine learning model during the training
process.
• These decisions impact model metrics, such as accuracy.
Integrative AI approach to predict T1D 33
May 13, 2022
34. AI Model Hyperparameters
• Example hyperparameters to tune:
• Batch size
• Number of epochs
• Number of hidden layers
• Dropout
• Learning rate
• Optimizer
• Regularization
• Gradient clipping
• Activators
• Loss/cost functions
• Tools exist to aid in hyperparameter tuning:
• Ray Tune
• KerasTuner
Integrative AI approach to predict T1D 34
May 13, 2022
35. Deep Learning Models
• Concatenated multi-layer
perceptron neural
networks (MLP) and
recurrent neural networks
(RNN)
Integrative AI approach to predict T1D 35
May 13, 2022
36. AI Model: Concatenated
• Concatenated multi-layer
perceptron neural networks
(MLP), bi-directional long-
short term memory (LSTM)
recurrent neural networks
(RNN), with hidden layers
• Single output layer for binary
prediction (1, 0)
Integrative AI approach to predict T1D 36
May 13, 2022
37. AI Testing and Validating
• AI uses training and
validation data to classify the
object as a Chihuahua or
muffin
• Similar approaches are used
to classify an image as either
benign or malignant tumor
• Entire AI process must be
tested and validated, from
data preprocessing to the
model outputs
May 13, 2022 Integrative AI approach to predict T1D 37
38. AI Model Output: Accuracy
• Training and validation data
fed into the AI model will
produce accuracy results of
the training and validation
data
• Accuracy for training and
validation of the AI model
Integrative AI approach to predict T1D 38
May 13, 2022
39. AI Model Output: Loss
• Binary cross-entropy
• Loss for training and validation
of the AI model
• The graph shows the model
loss for the training and
validation data
• If the validation loss
continually increases after a
specific epoch, this could be
due to overfitting
Integrative AI approach to predict T1D 39
May 13, 2022
40. AI Model Output: Precision
• What proportion of
positive identifications is
actually correct?
• Precision = TP / (TP + FP)
• 0.78125 = 25 / (25+7)
Integrative AI approach to predict T1D 40
May 13, 2022
41. AI Model Output: Recall
• Model Recall
“Sensitivity”
• What proportion of
actual positives are
identified correctly?
• Recall = TP / (TP + FN)
• 0.6097= 25 / (25+16)
Integrative AI approach to predict T1D 41
May 13, 2022
42. AI Model Output: AUC - ROC
• An AUC closer to one
provides a good measure
of separability
• The ROC curve shows the
trade-off between
sensitivity (or TPR) and
specificity (1 – FPR)
Integrative AI approach to predict T1D 42
May 13, 2022
43. AI Model Output: Iterative Testing Metrics
• Building AI models is
an iterative process.
• To validate the
model, data are
repeated at random
and evaluated by
the model.
• The results of this
model still require
further QA and
validation testing.
Integrative AI approach to predict T1D 43
May 13, 2022
Test Run 8 Output:
Test Run Accuracy Loss Precision Recall AUC F-Score
1 0.7229 0.6151 0.7812 0.6098 0.7113 0.7192
2 0.7229 0.6069 0.7813 0.6098 0.734 0.7192
3 0.6987 0.6052 0.7222 0.6341 0.7288 0.6975
4 0.7469 0.6152 0.7632 0.7073 0.7465 0.7465
5 0.7469 0.5963 0.7941 0.6586 0.7387 0.7445
6 0.7108 0.6563 0.7179 0.6829 0.7305 0.7106
7 0.7349 0.632 0.7209 0.7561 0.7445 0.7345
8 0.7229 0.6169 0.7813 0.6098 0.752 0.7192
0.72766 ± 0.02 0.60774 ± 0.01 0.7684 ± 0.03 0.64392 ± 0.04 0.73186 ± 0.01 0.72538 ± 0.02
44. AI Model Output: Important Temporal Features
• Top temporal features, ranked
by mean absolute SHAP value.
• The color bar corresponds to
raw values. If raw value is high,
it appears red. Low raw value,
appears blue.
• Each variable appears as its own
point.
• The distribution shows how a
variable may influence the
model.
• Features extending to right, may
influence a (1) prediction.
• Features extending to left, may
influence a (0) prediction.
Integrative AI approach to predict T1D 44
May 13, 2022
45. AI Model Output: Important Static Features
• Top Static Features
• The distribution shows how
a variable may influence the
model.
• Features extending to right,
may influence a (1)
prediction.
• Features extending to left,
may influence a (0)
prediction.
• Blue = low raw value
• Red = high raw value
Integrative AI approach to predict T1D 45
May 13, 2022
46. Project Limitations
• This project analyzes high-dimensional clinical research data, not all data are available
during the time of this research.
• T1D cases available from the TEDDY study have nearly doubled, TEDDY is in the process
of developing a second case-control, with the additional data, this could improve the
statistical power and accuracy of this project
• T1D is an endpoint in the TEDDY study, the study participants who are diagnosed with
T1D are no longer followed after their study endpoint visit. This limits the amount of data
available.
• AI model output and important features must be validated and reviewed by TEDDY
researchers.
• Possibly difficult to generalize results with T1D general population.
• Some data required censoring, imputation, and exclusion.
• Right censoring as ordinary linear regression often ineffective at handling the censoring
of observations.
Integrative AI approach to predict T1D 46
May 13, 2022
47. Current Project Outcomes
• We successfully developed an AI framework that incorporates -omics and
environmental data to predict outcomes.
• Developed packages to prepare data for the AI model and transform 2D
data into 3D data that can be fed into the AI model.
• The AI model is capable of combining temporal (time-series) and static
data.
• Combines (concatenates) AI neural networks, LSTM and MLP.
• Currently predicting IA± on 83 test records with a mean accuracy of: 0.72
• While the AI model still needs to be validated internally by the project
team and TEDDY researchers, a combination of –omics data and static data
from the TEDDY study can be run through the AI model. The project is still
in development and being tested.
Integrative AI approach to predict T1D 47
May 13, 2022
48. Next Steps
• Enhance current AI models.
• Hyperparameters, Quality Assurance (QA), validate and test models.
• The current model is being developed to predict IA±, but additional data
are needed to feed the model and predict T1D as an outcome.
• Acquire and prepare data for additional Neural Network (NN) models.
• Program, validate, and test additional NN models for additional data.
• Quality control of AI framework and model, internal/construct validity.
• Generalize AI framework to incorporate additional methods and models for
predictions of various diseases.
• Collaborate with TEDDY researchers to validate AI model outcomes and
important features.
Integrative AI approach to predict T1D 48
May 13, 2022
49. Acknowledgements
• Health Informatics Institute at the University of South
Florida (USF)
• Dr. Jeffrey Krischer
• Dr. Kristian Lynch
• Leo Moreno
• Chris Shaffer
• TEDDY Publications Committee
• TEDDY Study Group:
• NIH, TEDDY DCC, clinical center investigators, clinical center
staff, and TEDDY study participants.
• dkNET / NIDDK
Integrative AI approach to predict T1D 49
May 13, 2022