AI from the Perspective of a School of Data Science
1. AI from the Perspective of a School of Data Science
Philip E. Bourne PhD
peb6a@virginia.edu
https://www.slideshare.net/pebourne
October 27, 2022 ORNL AI Workshop
2. My Perspective aka Biases
• AI User
• Practical Science Long standing computational biomedical researcher
• Open Access Co-Founder and Founding Editor in Chief PLOS
Computational Biology
• Open Knowledge First President of FORCE11
• Data are Value Involved in FAIR
• Translation First Associate Vice Chancellor for Innovation and
Industrial Alliances, UCSD
• Funders as Lever First Associate Director for Data Science, NIH – preprints,
data sharing, BD2K, etc.
• Change Higher Ed Founding Dean School of Data Science, UVA
3. AI Solved What Some Have Called the Holy Grail of
Molecular Biology
https://medium.com/proteinqure/welcome-into-the-fold-bbd3f3b19fdd
1-D 3-D
7. AlphaFold2
Numerical optimization – differential programming
Overall gradient descent trained to win CASP
Jumper et al.., 2021. Nature, 596 (7873),
pp.583-589
Transformer models using attention
Geometry invariant to
translation/rotation
8. Reasons Behind the Win
● Nothing fundamentally new from an AI perspective
● Data integration
● Collaboration not competition
● Engineering challenge beyond most labs
● Compute power beyond most labs
● Team size beyond most labs
● Worked with protein structure specialists
9. While a victory for AI
there are implications
that require a closer
look …
https://www.dreamstime.com/
10. Reasons Behind the Win – Lessons for Data Science
● Nothing fundamentally new from an AI perspective
● Data integration – data science vs data engineering
● Collaboration not competition – team building is critical
● Engineering challenge beyond most labs – systems that scale up
● Compute power beyond most labs – systems that scale up
● Team size beyond most labs – human systems that scale up
● Worked with protein structure specialists – domains rule
11. Implications for Science
The fourth paradigm changes how we think of the
long standing need to perform an act of
reductionism.
In the context of protein structure prediction, I
refer to this as the Curse of the Ribbon.
Let me explain…
[Forthcoming: Bourne, Draizen, Mura, PLOS Biology 2022]
12. Protein Fold Space – Human Reductionism
There are ~ 20300 possible proteins
>>>> all the atoms in the Universe
~189M protein sequences from
292K organisms (source UniProt))
Classified into ~1500 folds (source SCOP)
https://doi.org/10.1073/pnas.2628030100
It has become apparent that fold space is more continuous
13. Curse of the Ribbon
[From Cam Mura]
The human desire to bin/classify/reduce and to simplify how we view
data (ribbon diagram) while useful masks {we argue} aspects of the data
that algorithms can see
14. Downstream Implications for Data Science
• Cooperation rather than competition
• Public-private partnership
• Translational possibilities are endless
• Made possible by curated open data
• Appreciate engineering
15. As one example of the future success of AI, how
does a school of data science think of itself?
It starts with a simple foundational model of
data science that all in the school agree upon
16. The 4+1 Model of Data Science
• Value – assuring societal
benefit
• Design - Communication
of the value of data
• Systems – the means to
communicate and
convey benefit
• Analytics – models and
methods
• Practice – where
everything happens
From: Alvarado & Bourne, AI for Science Eds. Choudhary, Fox & Hey, 2023
17. The Data Science Interplay
• Value + Design = Openness,
responsibility
• Value + Analytics = Human
centered AI, algorithmic bias
• Value + Systems =
sustainability, access,
environmental impact
• Design + Analytics = literate
programming, visualization
• Design + Systems =
dashboards, engineering
design
• Analytics + Systems = ML
engineering
[From Raf Alvarado]
Thinking of data as a science unto itself is novel and controversial
19. Databases
organize data
around a project.
Data warehouses
organize the data
for an organization
Data commons
organize the data
for a scientific
discipline or field
Data
Warehouse
Data Ecosystems
Example – We Consider the
Evolving Systems that Support
AI
20. Challenges
Fixed level of funding
Opportunities
data commons
Data commons co-locate data
with cloud computing
infrastructure and commonly
used software services, tools &
apps for managing, analyzing and
sharing data to create an
interoperable resource for the
research community.*
*Robert L. Grossman, Allison Heath, Mark Murphy, Maria Patterson and Walt Wells, A Case for Data Commons Towards Data Science as a Service, IEEE
Computing in Science and Engineer, 2016. Source of image: The CDIS, GDC, & OCC data commons infrastructure at a University of Chicago data center.
Bonazzi VR, Bourne PE (2017) Should biomedical research be like Airbnb? PLoS Biol 15(4): e2001818.
Systems
[Adapted from Bob Grossman]
22. A Data Science Poster Child
Researcher and Assistant Professor of
Medicine Dr. Thomas Hartka, also a
current online Masters in Data Science
student, is combining two disparate
data sets—electronic health records
and DMV crash data—to save lives
after motor vehicle crashes.
“I enrolled in the MSDS program to
expand my research on automotive
safety. I have already used
techniques from classes in my work.
I hope to expand my research to
real-time analytics to improve
emergency room care.”
— Dr. Thomas Hartka, UVA School
of Medicine
26. Research ethics
committees (RECs) review
the ethical acceptability
of research involving
human participants.
Historically, the principal
emphases of RECs have
been to protect
participants from physical
harms and to provide
assurance as to
participants’ interests and
welfare.*
[The Framework] is
guided by, Article 27
of the 1948 Universal
Declaration of Human
Rights. Article 27
guarantees the rights
of every individual in
the world "to share in
scientific
advancement and its
benefits" (including to
freely engage in
responsible scientific
inquiry)…*
Protect human
subject data
The right of human
subjects to benefit
from research.
*GA4GH Framework for Responsible Sharing of Genomic and Health-Related Data, see goo.gl/CTavQR
Data sharing with protections provides the evidence
so patients can benefit from advances in research.
Balance protecting human subject data
with open research that benefits
patients
[Adapted from Bob Grossman]
Value
27. Why Responsible Data Science?
• A defining feature
• A partnership between STEM, social
sciences and the humanities
• Where UVA has strength
29. Daily Challenges
• Deciding what not to do
• Competition for the best team members (faculty and staff)
• Establishing a diverse team
• Lack of a comprehensive enterprise-wide data infrastructure
• Its easier to conform
30. During my 5-year interview as dean I was asked,
“Will we need a school of data science in 10 years
wont it be ubiquitous throughout the university?”
My response,
“Will we need a university in ten years? Wont it be
one big school of data science?”
https://pebourne.wordpress.com/2022/06/29/deans-blog-
data-science-ten-years-from-now/
31. Questions I Leave You With ….
• Have I overstated the case for data science?
• Are we currently doing the best by our students?
• Are the models we propose the right ones?
• Where do we go from here?
32. Punchline – in 45+ Years in Academia I Have
Never Seen Anything Like It
• It is a response to the digital transformation of
society
• It is touching every discipline (aka vertical)
• We can’t keep the students out of our classes
• Cause – large amounts of digital data
• Effect – interdisciplinarity, openness, translation,
search for responsibility and more
In summary, it is disruptive and higher ed. better pay attention
Hinweis der Redaktion
I will introduce the concept of data science with a story that illustrates - citizen engagement, merging of unexpected data and societal benefit