1. Big Data and its Role in
Biomedical Research
Philip E. Bourne PhD, FACMI
Stephenson Chair of Data Science
Director, Data Science Institute
Professor of Biomedical Engineering
peb6a@virginia.edu
https://www.slideshare.net/pebourne
10/10/18 ACoP 2018 1
@pebourne
2. Bias
• Cant help but be influenced by my time as Associate
Director for Data Science (ADDS) at NIH
• Now very much engaged in data science across disciplines
– broader but shallower perspective
• Knowing my long-time colleague Prof. Lei Xie and others
will follow me with a deeper perspective
10/10/18 ACoP 2018 2
4. Big data and data
science are like
the Internet…
If I asked you to
define them you
would all say
something
different, yet you
use them every
day…
10/10/18 ACoP 2018 4
http://vadlo.com/cartoons.php?id=357
5. So what do I mean by big data/data
science?
• Use of the ever increasing amount of open, complex, diverse
digital data
• Finding ways to ask and then answer relevant questions by
combining such diverse data sets
• Arriving at statistically significant conclusions not otherwise
obtainable
• Sharing such findings in a useful way
• Translating such findings into actions that improve the human
condition
10/10/18 ACoP 2018 5
7. Machine learning has been around for over 20
years – why the fuss now?
• Amount of data available for training
• Open source - R and python
• Advances in computing (e.g., GPU’s) allow for deeper neural nets (deep
learning)
• Algorithmic efficiency gains (e.g., in back propagation)
• Success promotes further research
• Commercialization
10/10/18 ACoP 2018 7
Pastur-Romay et al. 2016 doi:10.3390/ijms17081313
8. The NIH view
• Big Data
– Total data from NIH-funded research in 2016 estimated at 650 PB*
– 20 PB of that is in NCBI/NLM (3%) and it is expected to grow by 10
PB in 2016
• Dark Data
– Only 12% of data described in published papers is in recognized
archives – 88% is dark data^
• Cost
– 2007-2014: NIH spent ~$1.2Bn extramurally on maintaining data
archives
* In 2012 Library of Congress was 3 PB
^ http://www.ncbi.nlm.nih.gov/pubmed/26207759
10/10/18 ACoP 2018 8
9. NIH strategic plan for data
• Support a Highly Efficient and Effective
Biomedical Research Data
Infrastructure
• Promote Modernization of the Data-
Resources Ecosystem
• Support the Development and
Dissemination of Advanced Data
Management, Analytics, and
Visualization Tools
• Enhance Workforce Development for
Biomedical Data Science
• Enact Appropriate Policies to Promote
Stewardship and Sustainability
10/10/18 ACoP 2018 9
https://grants.nih.gov/grants/rfi/NIH-Strategic-Plan-for-Data-Science.pdf
10. A research data infrastructure requires
we move from pipes to platform…
which begs the question ...
10/10/18 ACoP 2018 10
Vivien Bonazzi Bonazzi & Bourne 2017, PLoS Biol. 7;15(4):e2001818.
Will biomedical research become more like Airbnb?
11. I am not crazy, hear me out
• Airbnb is a platform that supports a trusted relationship between consumer
(renter) and supplier (host)
• The platform focuses on maximizing the exchange of services between supplier and
consumer and maximizing the amount of trust associated with a given stakeholder
• It seems to be working:
– 60 million users searching 2 million listings in 192 countries
– Average of 500,000 stays per night.
– Evaluation of US $25bn
10/10/18 ACoP 2018 11
Bonazzi & Bourne 2017, PLoS Biol. 7;15(4):e2001818.
13. The pillars of data science operate
within this platform environment
13
QSP
10/10/18 ACoP 2018
14. Lets briefly focus on those five pillars
in the Context of QSP …
10/10/18 ACoP 2018 14
15. Data acquisition
The data production issue (the V’s of Big Data)— Experimentally
• Estimated (2017) that ≈2.5 quintillion (2.5×1018) bytes of data generated daily, with 90%
of all the world’s data having been created in the past two years.
• Plaintext PDB files typically ≈ few 100s KB (…but, that’s just the start!)
Mura et al. 2018 Curr Opin Struct Biol. 52:95-102
10/10/18 ACoP 2018 15
16. Data integration and engineering
• Generic
– Ontologies
– Object identifiers
– Indexing schemes
– Common data models
1610/10/18 ACoP 2018
19. Ethics, law & policy
10/10/18 ACoP 2018 19
• Landmark studies identify
histone mutations as
recurrent driver mutations in
DIPG ~2012
• Almost 3 years later, in
largely the same datasets,
but partially expanded, the
same two groups and 2
others identify ACVR1
mutations as a secondary,
co-occurring mutation
From Adam Resnick
Diffuse Intrinsic Pontine Glioma (DIDG)
20. Conclusion:
Driven by large amounts of open
digital data of different types and new
algorithms and approaches biomedical
researchers are destined to follow the
private sector towards the fourth
paradigm
10/10/18 ACoP 2018 20
21. Acknowledgements
10/10/18 ACoP 2018 21
The BD2K Team at NIH
My Colleagues at UVA
The 150 folks who have passed through my laboratory
https://docs.google.com/spreadsheets/d/1QZ48UaKcwDl_iFCvBmJsT03FK-bMchdfuIHe9Oxc-rw/edit#gid=0
Zheng Zhao Lei Xie
Model integration in systems pharmacology. Diverse models need to be integrated
across multiple methodologies, multiple heterogeneous data sets, organismal hierarchy, and
species (transportability).
$1.25bn per year to capture all data.
After a significant effort at reduction, intramurally data is spread across > 60 data centers; imagine the extramural situation.
Distribution of kinases and the number of covalent small-molecule kinase inhibitors (CSKIs) for every targeted kinase across the human kinome