Big Data and Data Science: Opportunities for Biomedical Engineering
1. Big Data and Data Science:
Opportunities for Biomedical
Engineering
Philip E. Bourne PhD, FACMI
Stephenson Chair of Data Science
Director, Data Science Institute
Professor of Biomedical Engineering
peb6a@virginia.edu
https://www.slideshare.net/pebourne
04/08/18 AIMBE Academic Council 1
@pebourne
2. Disclaimer
• This is mostly NOT a talk about my own
research
• It draws upon my now one-year old view of
NIH as the former Associate Director for Data
Science (ADDS)
• It suffers from my drinking my own Kool-aid at
the University of Virginia
04/08/18 AIMBE Academic Council 2
3. Take home (hopefully)
• Increased awareness of the value of data
science to your activities
• Increased awareness of where NIH is headed
• Some thoughts about how to build out data
science in your own institutions
04/08/18 AIMBE Academic Council 3
4. Big data and data science are like the
Internet…
If I asked you to define them you would all
say something different, yet you use them
every day…
04/08/18 AIMBE Academic Council 4
http://vadlo.com/cartoons.php?id=357
5. So what do I mean by big data/data
science?
• Use of the ever increasing amount of open, complex,
diverse digital data
• Finding ways to ask and then answer relevant
questions by combining such diverse data sets
• Arriving at statistically significant conclusions not
otherwise obtainable
• Sharing such findings in a useful way
• Translating such findings into actions that improve
the human condition
04/08/18 AIMBE Academic Council 5
6. Cause
• There are ~2.7 Zetabytes (2.7 x 106 PB) of digital data
• Volume is doubling every two years
• Sheer volume of digital data e.g., $1000 genome,
wearable sensors, mandatory EHRs
• New tools e.g., Deep Artificial Neural Networks (DNNs)
• New computing power e.g., GPUs
04/08/18 AIMBE Academic Council 6
7. Effect
• Big data currently estimated as a $50bn
business – could save $3.1tn
• 50% growth in data/yr; 5% growth in IT
expenditure
• US 140,000- 190,000 unfilled deep data
analytics jobs
• UVA DSI has 600 applicants this year for 50
spots; MSDS/MBA highly sought
AIMBE Academic Council 704/08/18
8. Effect ++
• Big data currently estimated as a $50bn business
– could save $3.1tn – private sector research
• 50% growth in data/yr; 5% growth in IT
expenditure - undervalued
• US 140,000- 190,000 unfilled deep data analytics
jobs – competition for skilled researchers high
• DSI has 600 applicants this year for 50 spots;
MSDS/MBA highly sought – large human capital
AIMBE Academic Council 804/08/18
9. How much biomedical data?
• Big Data
– Total data from NIH-funded research in 2016
estimated at 650 PB*
– 20 PB of that is in NCBI/NLM (3%) and it is
expected to grow by 10 PB in 2016
• Dark Data
– Only 12% of data described in published papers is
in recognized archives – 88% is dark data^
• Cost
– 2007-2014: NIH spent ~$1.2Bn extramurally on
maintaining data archives
* In 2012 Library of Congress was 3 PB
^ http://www.ncbi.nlm.nih.gov/pubmed/26207759
04/08/18 AIMBE Academic Council 9
10. Consider some current high profile NIH
examples where and how data science is
being applied
• Moonshot - platforms and
integration, ML
• MODs – automated curation
• Human Microbiome Project –
new cloud based tools, ML
• TOPMed - platforms and
integration
• All-of-Us - platforms and
integration
• ECHO – platforms and
integration
• BRAIN - ML
10
All: Analytics, the Commons, FAIR, sustainability, workforce
04/08/18 AIMBE Academic Council
11. What of the future?
One view is the 6D’s
04/08/18 AIMBE Academic Council 11
13. A call for making these data open
• Mandates
– NIH, NSF, Data
Management Plans
• Business models can be
protected yet everyone
benefits
• It saves lives ….
04/08/18 AIMBE Academic Council 13
14. Why a More Open Process?
Use case:
Diffuse Intrinsic Pontine Gliomas (DIPG)
• Occur 1:100,000
individuals
• Peak incidence 6-8 years
of age
• Median survival 9-12
months
• Surgery is not an option
• Chemotherapy ineffective
and radiotherapy only
transitive
From Adam Resnick04/08/18 AIMBE Academic Council 14
15. Timeline of genomic studies in DIPG
• Landmark studies identify
histone mutations as
recurrent driver mutations in
DIPG ~2012
• Almost 3 years later, in
largely the same datasets,
but partially expanded, the
same two groups and 2
others identify ACVR1
mutations as a secondary, co-
occurring mutation
From Adam Resnick
04/08/18 AIMBE Academic Council 15
16. What do we need to do differently to
reveal ACVR1?
• ACVR1 is a targetable kinase
• Inhibition of ACVR1 inhibited tumor
progression in vitro
• ~300 DIPG patients a year
• ~60 are predicted to have ACVR1
• If large scale data sets were only
integrated with TCGA and/or rare
disease data in 2012, ACVR1 mutations
would have been identified
• 60 patients/year X 3 years = 180
children’s lives (who likely succumbed to
the disease during that time) could have
been impacted if only data were FAIR
From Adam Resnick
04/08/18 AIMBE Academic Council 16
17. How to promote
departmental/institutional openness?
• Encourage persistent identifiers e.g., ORCID
• Encourage preprints
• Encourage Open Access (OA)
• Recognize openness in hiring and P&T
• Teach open scholarship
• Promote institutional openness – repositories,
wikimedian in residence
• Support institutional open data governance
04/08/18 AIMBE Academic Council 17
18. NIH Strategic Plan for Data
• Support a Highly Efficient and
Effective Biomedical Research
Data Infrastructure
• Promote Modernization of the
Data-Resources Ecosystem
• Support the Development and
Dissemination of Advanced
Data Management, Analytics,
and Visualization Tools
• Enhance Workforce
Development for Biomedical
Data Science
• Enact Appropriate Policies to
Promote Stewardship and
Sustainability
04/08/18 AIMBE Academic Council 18
https://grants.nih.gov/grants/rfi/NIH-Strategic-Plan-for-Data-Science.pdf
19. Research Data Infrastructure …
Both funders and some institutions
see the need to move from pipes to
platforms to accelerate research…
04/08/18 AIMBE Academic Council 19
https://blog.lexicata.com/wp-content/uploads/2015/03/platform-model-
750x410.png
20. If platforms are the answer we could
ask the question…
Will biomedical research become more
like Airbnb?
04/08/18 AIMBE Academic Council 20
Vivien Bonazzi
Should biomedical research be Like Airbnb?
doi: 10.1371/journal.pbio.2001818
21. I am not crazy, hear me out
• Airbnb is a platform that supports a trusted relationship
between consumer (renter) and supplier (host)
• The platform focuses on maximizing the exchange of services
between supplier and consumer and maximizing the amount
of trust associated with a given stakeholder
• It seems to be working:
– 60 million users searching 2 million listings in 192 countries
– Average of 500,000 stays per night.
– Evaluation of US $25bn
04/08/18 AIMBE Academic Council 21
Should biomedical research be Like Airbnb?
doi: 10.1371/journal.pbio.2001818
22. Platforms will ultimately digitally
integrate the scholarly workflow for
human and machine analysis
Should biomedical research be Like Airbnb?
doi: 10.1371/journal.pbio.2001818
AIMBE Academic Council 2204/08/18
23. Paper Author Paper Reader
Data Provider Data Consumer
Employer Employee
Reagent Provider Reagent Consumer
Software Provider Software Consumer
Grant Writer Grant Reviewer
Supplier Consumer Platform
MS Project
Google Drive
Coursera
Researchgate
Academia.edu
Open Science
Framework
Synapse
F1000
Rio
Educator Student
Pilot Open Data Lab
(ODL) underway
AIMBE Academic Council 23gDOC04/08/18
24. Why a comparison to Airbnb is not fair
• Airbnb was born digital
• The exchange of services on Airbnb are
simple compared to what is required of a
platform to support biomedical research
Nevertheless there is much to be
learnt
04/08/18 AIMBE Academic Council 24
25. Impediments to a biomedical platform
• Current work practices by all stakeholders
• Entrenched business models
• Size of the undertaking aka resources
needed
• Trust
• Incentives to use the platform
http://www.forbes.com/sites/johnhall/2013/04/29/1
0-barriers-to-employee-innovation/#8bdbaa811133
04/08/18 AIMBE Academic Council 25
26. Such platforms combined with
emerging analytics will likely have
significant impact on biomedical
engineering
04/08/18 AIMBE Academic Council 26
27. Machine learning has been around for
over 20 years – why now?
• Amount of data available for training
• Open source - R and python
• Advances in computing (e.g., GPU’s) allow for deeper
neural nets (deep learning)
• Algorithmic efficiency gains (e.g., in back
propagation)
• Success promotes further research
• Commercialization
04/08/18 AIMBE Academic Council 27
Pastur-Romay et al. 2016 doi:10.3390/ijms17081313
28. Let me touch on our research in
protein engineering oh so briefly….
04/08/18 AIMBE Academic Council 28
Structural Biology Meets Data Science – Does Anything
Change?
Crowd Source: Current Opinions in Structural Biology 2018
https://docs.google.com/document/d/1rD3Qh1btTYlnGkKefN
GSFVq8v_mqRNa8I0o5MP3ZMW4/edit
29. Are their new scaffolds out there Nature
has yet to discover that AI could?
There are ~ 20300 possible proteins
>>>> all the atoms in the Universe
96M protein sequences from
73,000 species (source RefSeq)
135,000 protein structures
yield 1221 folds (SCOPe 2.06)
AIMBE Academic Council 2904/08/18
30. AIMBE Academic Council 30
At DeepMind, which is based in London,
AlphaGo Zero is working out how proteins
fold, a massive scientific challenge that
could give drug discovery a sorely needed
shot in the arm.
04/08/18
39. Ethics, Law,
Policy & Social
Implications
• Data sharing
• Privacy
• Normativity
AIMBE Academic Council 39gDOC
Wendy Novicoff, Ph.D
04/08/18
40. Conclusion:
Driven by large amounts of open
digital data of different types and new
algorithms and approaches biomedical
researchers are destined to follow the
private sector towards the fourth
paradigm
04/08/18 AIMBE Academic Council 40
41. Acknowledgements
04/08/18 AIMBE Academic Council 41
The BD2K Team at NIH
My Colleagues at UVA
The 150 folks who have passed through my laboratory
https://docs.google.com/spreadsheets/d/1QZ48UaKcwDl_iFCvBmJsT03FK-bMchdfuIHe9Oxc-rw/edit#gid=0