What is health data science?
• Data-driven solution to solve complex real world health problems
• Or to derive knowledge from unstructured and messy data
• It is an interdisciplinary field: biostatistics, computer science,
epidemiology, public health, mathematics, etc
Real life health data science example
• HIV:
• Visualising the pattern of early HIV transmission within the mucosal barrier
• COVID-19:
• What can predict covid-19 neutralisation activity?
• Can we predict covid-19 vaccine efficacy?
Background
• Early HIV transmission event might occur during vaginal or anal sex
• Want to investigate if the mucosal barrier (within the vaginal tissue) is
effective in blocking HIV virus transmission or not
If the mucosal barrier is good in preventing viral
transmission, this is what we expect to see
If the mucosal barrier is not good at preventing
transmission, multiple viruses can be found
(random infection)
If the mucosal barrier is not good at preventing
transmission, multiple viruses can be found
(clustered infection)
14
Data Visualisation
Can still see many viral variants
no evidence that the vaginal tissue
is effective in blocking viral entry
Need a formal method
• How can we say (formally) if infection is spatially clustered (or not) ?
• Mantel test (or Mantel and Valand) -> relate a matrix of
“geographical” distance and a matrix of “biological” distance
• So, need to define the “geographical” matrix and “biological” matrix
first
15
Mantel Test (or Mantel and Valand)
• Testing the association between two matrices
• Mantel quantity (Zm) is given by:
• Basic idea -> permutation test
• Randomly changing the rows and columns of the two matrices
• And store the value of Zm for each permutation of rows and columns
Zm = gij
j
å
i
å bij
19
Background
• Neutralising antibody (NAb): antibody that can defend the host from
the specific pathogen
• Data: 41 convalescent adults; measured several immunological
parameters (13 parameters total)
• Goal: want to know in those 41 recovered patients, what
immunological parameters can be used to predict NAb
Methods
• Data visualisation is very important in data science
• First step: plot the correlation matrix for the whole dataset
Ok, not very informative….
Have so many things correlated with microneutralization
Methods
• Correlation matrix shows that Nab is correlated with so many things
• Next step: Can I find some hidden features in this dataset?
• Method: principal component analysis (PCA)
The main focus is microneutralization
If the angle between microneut and another variable is less
than 90o; then it’s a positive association
If the angle between microneut and another variable is greater
than 90o; then it’s a negative association
For instance, higher ELISA S trimer gives higher
microneutralization level (less than 90o)
For instance, higher CCR6+CXCR3- gives lower
microneutralization level (more than 90o)
Methods
• PCA visualisation is better than correlation matrix
• But, still cannot just pick one thing that can be used to predict NAb
• Next step: I want to only pick one thing to predict NAb
• Method: multiple linear regression with a backward model selection
strategy
• The idea is to run a linear regression with all the variables, and iteratively
remove non-significant predictor until all the predictors are significant
Background
• At the end of the phase 2 trial, we get the immunogenicity data
(measuring the amount of antibody)
• Given the data from phase 2 trial (antibody data), can we predict
what the efficacy of the vaccine will be?
• Training dataset: efficacy and antibody data from all available vaccines
Methods
• The first step is always to visualise your data, so why don’t we plot
efficacy against antibody first?
High antibody = high efficacy
Low antibody = low efficacy
Can we simply do a classification method based on the
level of antibody?
Methods
• The model is a distribution-free binary classification model, based on
the threshold level of antibody
• The lower your antibody level, higher chance for you to be infected,
so the vaccine efficacy will be lower
• The higher your antibody level, lower chance for you to be infected,
so the vaccine efficacy will be higher
• We want to know what is this threshold of antibody
We normalised the antibody to the convalescent patients
(the mean for convalescent is one)
Covaxin data came out a bit later, so we used covaxin to
validate our ‘classifier’ model
Using our classifier, as long as we have antibody data (from
phase 2 trial), we can predict any vaccine efficacy