This document discusses Roberta Balcytyte's research using machine learning for early detection of rare hereditary diseases from large, imbalanced datasets. Specifically, the research aims to develop models to detect Hereditary Angioedema (HAE) using data on 1,200 HAE cases and 165 million controls. Initial results found Random Forest and AdaBoost classifiers performed best, accurately detecting HAE cases 88-89% of the time on average. The research seeks to supplement medical diagnosis by making rare disease detection faster and more accurate through machine learning.
2. ROBERTA BALCYTYTE
MACHINE LEARNING FOR EARLY DETECTION OF
HEREDITARY RARE DISEASE.
Prediction of rare disease from
big and severely imbalanced data.
Rare diseases affect a realively small
number of people (1 per 2,000)
compared to the general population
and specific issues are raised in
relation to their rarity. In particular,
many patients are not diagnosed,
the diagnosis is delayed or wrongly
determined resulting in inappropriate
treatments. This research aims to
investigate machine learning based
models which can capture early
flags of rare disease, namely HAE,
and supplement medical diagnosis
procedure by making it faster and
more accurate. The rarity of the
condition poses the under-investigated
computational challenge of severe
class imbalance within big data
(1,200 cases vs. 165M controls).
Task:
For the experiments we used a
cleaned≈data set which contained
165M US patients with ~1,200 HAE
positive cases and ~240 features
on relevant medical events.
Firstly, we experimented with six
classifiers to fit prediction model:
L2 regularized logistic regression,
SVM with Gaussian kernel, decision
tree, random forest, AdaBoost and
Gaussian Naïve Bayes. All classifiers
were tuned within cross-validation and
trained on randomly under-sampled
controls’ for 50 iterations.
Secondly, we applied an advanced
technique for under-sampling the
majority class from a big data set.
The technique is based on Tomek-links
and is parallelizable.
The research was conducted in
collaboration with industry partner
‘IMS Health’ and lead by UCL Prof.
DelmiroFernandez-Reyes.
Review:
The project is still in progress but
in the first stage we have found
that Random Forest and AdaBoost
outperformed other classifiers with an
average AUC of 88.9% (std. 0.87%)
and 89.0% (std. 0.81%) respectively.
Furthermore, AdaBoost achieved a
higher sensitivity of 63.97% compared
to Random Forest while sustaining a
relatively high specificity of 93.17%
(Fig. 2).
During the second stage the advanced
under-sampling technique proved itself
to improve the predictive power of the
classifier, but only slightly.
To our knowledge this research is the
first attempt to apply machine learning
to predicting HAE and is one of the
few studies focusing on rare disease
prediction in general from the current
big and severely imbalanced data set.
3. What makes this project unique?
This project is the first in its
field to apply machine learning
to HAE prediction.
Our aim is to build a predictive
model for a very rare disease from
big and severely imbalanced data.
It will make the diagnosis of such a
disease faster and more accurate.
What are your
plans for the future?
I have been learning and
implementing new techniques
of under-sampling and parallel
computing. After it is finished I
would like a data analytics job
within healthcare. The project has
definitely motivated me to continue
working within that industry.
Did you always know this was
the area you wanted to work in?
Originally my background was
in Economics. After graduating
I joined EY for 5 years, doing
projects related to process
analysis, improvement, risk
assessment and organizational
performance monitoring - not
very related to data analytics
or programming!
I decided to convert to data
analytics after my secondment year
in the Enterprise Intelligence & Data
Analytics Centre of Excellence.
It inspired me to pursue new
challenges and become a data
scientist. So I applied to UCL to
do a MSc in Business Analytics.
What has been the
highlight so far?
I started the program from scratch,
not knowing anything before. I am
proud of my endeavor to becoming
a data scientist. It’s been a very
steep learning curve!
What advice would you
give your 18 year old self?
Always keep learning
and exploring.
What is changing in engineering?
This research project has
convinced me that revised, novel
algorithms and new tools are
needed to leverage the treasures
of big data.
What excites you about the
opportunities with data today
and in the future
I believe that data science will
start evolving in the healthcare
industry more and more rapidly and
very soon. This will lead to better
medical services, saved lives and
an improved quality of life for many,
many people.
I am very excited to be part of
this, developing innovative, more
efficient ways of curing people or
even better, preventing diseases.
////
Q&A
WE SAT DOWN WITH ROBERTA AND ASKED HER A FEW
QUESTIONS ABOUT HER PROJECT AND ASK WHAT SHE
THINKS THE FUTURE HOLDS FOR HERSELF.
“It’s been
a very
steep
learning
curve!”