Science is a challenging domain for data science. Scientists collect large amounts of data through observations and experiments, but the main challenge of data analysis in science is not big data. First, scientific discovery is not solely about data or models learned from data but about finding relations between them. Moreover, discovered models should make accurate predictions and, more importantly, provide a deeper understanding consistent with existing scientific theories. Finally, science is about models that are interpretable and stated in established scientific formalisms. The field of computational scientific discovery aims at understanding the processes of scientific discovery and implementing tools that can assist scientists. The talk will provide an overview of recent advances in computational scientific discovery, focusing on methods for discovering mathematical equations from observational data.
6. The Nobel Prize
in Chemistry 2013
Summary
In the 1970s, Michael Levitt,
Martin Karplus, and Arieh Warshel
successfully developed methods
that combined quantum and
classical mechanics to calculate
the courses of chemical reactions
using computers.
6
7. The Nobel Prize
in Chemistry 2013
Jogalekar: ScientiïŹc American,
October 2013
First and foremost it is a prize for
the ïŹeld rather for individuals, a
signal from the Nobel
Committee that computational
methods have come of age.
You would be hard-pressed these days to ïŹnd
papers that don't include at least some
computational component, from the simple
visualization of a molecule to very rigorous
high-level quantum mechanical calculations.
7
8. Talk Outline
â Challenges for data science in science
â Computational ScientiïŹc Discovery
â Results
â Methods
â Conclusion
8
9. Challenges of Data Science in Science
1. Not only data and not only models, relation between data and models
2. Not only accurate predictions, understanding of the models, which is
consistent with existing scientiïŹc theories
3. Not any kind of models, understandable modes stated in established
scientiïŹc formalisms
9
10. Computational
ScientiïŹc Discovery
DeïŹnition
Research in ArtiïŹcial Intelligence
that aims to develop computer
systems which produce results
that, if a human scientist did the
same, we would refer to as
discoveries.
10
Dzeroski and Todorovski, Eds (2007) Computational Discovery of ScientiïŹc Knowledge. Springer.
13. Why Equations?
â Most common form of knowledge in science
â Capture the relationships between variables
â Understandable to human scientists
â Potential to explain the observed phenomena
13
16. Explanatory Power of Equations (1)
As phytoplankton uptakes nitrogen, its concentration increases and the
nitrogen decreases.
16
Bridewell et al. (2008) Machine Learning 71: 1â32 doi:10.1007/s10994-007-5042-6
17. Explanatory Power of Equations (1)
As phytoplankton uptakes nitrogen, its concentration increases and the
nitrogen decreases.
17
Bridewell et al. (2008) Machine Learning 71: 1â32 doi:10.1007/s10994-007-5042-6
18. Explanatory Power of Equations (1)
As phytoplankton uptakes nitrogen, its concentration increases and the
nitrogen decreases.
Note the model relation to the data.
18
Bridewell et al. (2008) Machine Learning 71: 1â32 doi:10.1007/s10994-007-5042-6
19. Explanatory Power of Equations (2)
The uptake continues until the nitrogen is exhausted, which leads to a
phytoplankton die oïŹ.
19
Bridewell et al. (2008) Machine Learning 71: 1â32 doi:10.1007/s10994-007-5042-6
20. Explanatory Power of Equations (3)
This produces detritus, which gradually remineralizes to replenish nitrogen...
20
Bridewell et al. (2008) Machine Learning 71: 1â32 doi:10.1007/s10994-007-5042-6
21. Explanation
Equations
As phytoplankton uptakes
nitrogen, its concentration
increases and the nitrogen
decreases. The uptake continues
until the nitrogen is exhausted,
which leads to a phytoplankton
die oïŹ. This produces detritus,
which gradually remineralizes to
replenish nitrogen. Zooplankton
grazes on phytoplankton, slowing
the latterâs increase and also
producing detritus.
21
22. More Results: Protist Dynamics
22
Simulated and observed trajectories for two predatorâprey data sets
Bridewell et al. (2008) Machine Learning 71: 1â32 doi:10.1007/s10994-007-5042-6
26. Formal Grammars as Knowledge Vehicles
26
Traditional use of formal grammars: SpeciïŹcation of languages and parsers
We use formal grammars
â To specify the space of plausible expressions in the domain of use,
i.e., expressions that are aligned with domain theory
â As generators of candidate expressions
30. Probabilistic Grammars
30
Grazing â const · PhytoPlankton · Nutrient [p]
Grazing â const · PhytoPlankton · Nutrient / (const + Nutrient) [1âp]
We use formal probabilistic grammars
â To specify the space of plausible expressions in the domain of use,
i.e., expressions that are aligned with domain theory
â As generators of candidate expressions
â DeïŹne a priori probability distribution over candidate expressions
31. Deterministic vs Probabilistic Grammars
31
Brence et al. (2021) Knowledge-Based Systems 224: 107077 doi:10.1016/j.knosys.2021.107077
32. Dimensionally Consistent Grammars
32
Take care of correctly combining measurement units
E[m] â E[m] + E[m]
E[s] â E[s] + E[s]
E[m/s] â E[m] / E[s]
V[m] â distance
V[s] â time
V[m/s] â velocity
Brence et al. (2023) Information Sciences 632: 742-756 doi:10.1016/j.ins.2023.03.073
33. From Grammars to Deep Generative Models
33
Logical step forward: replace grammars with more general generative models.
Q: can we eïŹciently train a deep generative model for expressions?
34. From Grammars to Deep Generative Models
34
Logical step forward: replace grammars with more general generative models.
Q: can we eïŹciently train a deep generative model for expressions?
A: Yes, but only if we develop an appropriate generative model.
35. HVAE: EïŹcient Generator of Expressions
35
Model architecture tailored to the hierarchical structure of the expressions
Recursive arrangement of 2-to-1 and 1-to-2 GRU
MeĆŸnar et al. (2023) arXiv:2302.09893 doi:10.48550/arXiv.2302.09893
36. Comparative Evaluation of HVAE
36
MeĆŸnar et al. (2023) arXiv:2302.09893 doi:10.48550/arXiv.2302.09893
37. EDHiE: Symbolic Regression with HVAE
Combination of
â Deep generative model HVAE
â Evolutionary algorithm
37
MeĆŸnar et al. (2023) arXiv:2302.09893 doi:10.48550/arXiv.2302.09893
40. Data Science
for Science
Blei and Smith (2017) PNAS 114(33):
8689-8692
doi:10.1073/pnas.1702076114
For each scientiïŹc problem, the data
scientist develops an understanding
of its context: how the data were
collected, existing theories and domain
knowledge, and the overarching goals of
the discipline.
Crucially, the data scientist solves
the problem iteratively and
collaboratively with the domain
expert. Together, they develop
computational and statistical tools to
explore data, questions, and methods
in the service of the goals of the
discipline.
40
41. Data Science
for Science
Blei and Smith (2017) PNAS 114(33):
8689-8692
doi:10.1073/pnas.1702076114
Data science is more than the
combination of statistics and
computer science. It requires
training in how to weave
statistical and computational
techniques into a larger
framework, problem by problem,
and to address
discipline-speciïŹc questions.
41
42. Do Not Try
to Replace Scientists
Perrakis and Sixma (2021)
EMBO Reports 22: e54046
doi:10.15252/embr.202154046
Notwithstanding all the justiïŹed
excitement about AlphaFold, this
achievement does not mean
though that AI will make
experimental structural biology or
its practitioners and tools
redundant. Structural biology will
remain essential for
understanding how proteins work
and how they dynamically interact
with each other.
42
43. Data Science for Science
43
Computational
Science
Application
Domain of Science
44. Take-Home Message
Three Basic Principles of
Data Science for Science
Establishing relation between
data and models
Building explanatory models
rooted in scientiïŹc theories
Casting models in standard
scientiïŹc formalisms
Tailoring the algorithms to the
problem and not vice-versa
44
45. Collaborators
Computational Scientists
Jure Brence, JoĆŸef Stefan Institute
Sebastian MeĆŸnar, JoĆŸef Stefan Institute
SaĆĄo DĆŸeroski, JoĆŸef Stefan Institute
Will Bridewell, Stanford University
Pat Langley, Stanford University
45
Domain Scientists
Matej Radinja, University of Ljubljana
Mateja Ć kerjanec, University of Ljubljana
NataĆĄa Atanasova, University of Ljubljana
Kevin Arrigo, Stanford University
46. Thanks for Your Attention, Discussion Time
46
Institutions Involved and Financial Support