Open Force Field: Scavenging pre-emptible CPU hours* in the age of COVID - Jeff Wagner
1. Open Force Field:
Scavenging pre-emptible
CPU hours* in the age of
COVID
Jeff Wagner, Open Force Field Scavenger King Technical Lead
*and panel spots
2. www.openforcefield.org
Automated
infrastructure enables
rapid experimentation
with minimum human
intervention
OPEN SOFTWARE
Access to large, high
quality experimental
and quantum chemical
data facilities easy
curation of balanced
train / test sets
OPEN DATA
Exploring new force
field science:
hypothesis - build
software - train - test -
iterate
is now almost routine
OPEN Software, OPEN Data, OPEN Science is rapidly
facilitating force field science!
OPEN SCIENCE
4. www.openforcefield.org
Training new force fields requires new data
Large molecule datasets (SMILES strings) Quantum chemical calculation results
(times thousands of molecules)
7. www.openforcefield.org
PRP is capable of running enormous quantum chemistry
workloads
OpenFF-1.0.0 released OpenFF-2.0.0 released
OpenFF begins using Nautilus
8. www.openforcefield.org
OpenFF force fields are state-of-the-art
OpenFF 2.0.0 outperforms other public small molecule force fields
and OpenFF force fields continue to improve
OpenFF 2.0.0 outperforms OpenFF 1.2.1, GAFF 2.1, and CGENFF in free energy calculations. The
proprietary OPLS3e force field shows the best performance.
Consider a water molecule moving around in space. Think of it like a ball and stick model. It’s going to have a list of different inter- and intramolecular motions and interactions. Let’s just consider the H-O bonds. In liquid water, these bonds stretch a bit back and forth, more or less symmetrically around an equilibrium distance we’ll call r_eq. This is modelled with a harmonic oscillator, the same one from physics class. Here there are two fitted parameters, one value for the equilibrium bond length and another for the force constant of the spring. In this case we can reliably fit these values to a combination of experimental data and highly accurate quantum chemical calculations, so we can with pretty good accuracy model this particular detail of the motion of water.
There are approximations that can be made for the other forces that atoms feel, which I won’t go into here. But when you add them all up, you get a physical model of a molecule that you can propagate forward through time, on the scale of femtoseconds per timestep. If these models are accurate enough, you can use them to guide research into chemicals with desired properties, for example finding drug candidates that bind to a protein or polymers that have a desired property.
The basic task here is to take a molecule and run QM calculations on it to determine its optimized geometry, which serves as a sort of baseline reference from first-principles physics that can be used in fitting and, also later, benchmarking. It’s not such a fundamentally difficult thing to do with one of the many off-the-shelf tools, but we need to do this at massive scale. Depending on the direction that fitting experiments go, we might have data sets of thousands to tens of thousands of molecules that are used in individual fits, so generating these data sets must be automated and standardized. There are a lot of different methods and settings that scientists might want to use in the QM space, so there must be a way to coherently communicate with different programs. Each of these calculations molecules takes on the order of a few hours to days per molecule, so we’re talking about a large amount of compute to leverage this data. And finally, the actual results must be stored in a way that’s rapidly accessible for future fitting experiments - ideally in a publicly accessible database so that the community can make use of our compute without needing to ask to create an account on one of our clusters or shipping harddrives around.
What we do: we create new, comprehensive force fields and systematically improve their accuracy through scientific innovation and large, high-quality datasets. This schematic gives an overview of that process. We begin by generating, curate, and sharing the datasets necessary for producing and benchmarking high-accuracy force fields. We create and maintain open-source software for systematic, automated parameter optimization to our curated datasets. We finally benchmark the force field to evaluate if it has been significantly improved. If force fields do not meet our standards, they return back into the pipeline. All infrastructure, datasets, and force fields created during this process are released openly with permissive licenses so users can rapidly use, modify, and extend our work.
PRP is used heavily in this force field creation workflow during the QC data generation stage.
This is all a pretty tall task, but fortunately it’s mostly a solved problem. We use QCArchive, which is a project out of the molecular software sciences institute at virginia tech. It’s basically a public archive of QM calculations but also includes a lot of infrastructure for generating new data, including talking to different QM engines via a unified interface (QCEngine). Two of the key contributors to the projects (Daniel Smith and Lori Burns) gave a SciPy talk about this project a few years ago. The scale of the project has grown and the backend has been partially rewritten since then, but the talk holds up today. QCArchive handles most of the hard stuff - storing results in a database, running QM calculations with Psi4 - and we built a tool that makes our communication with it a little easier, since there’s some pre-processing we need to do before sending stuff off to QM. This tool is called QCSubmit and pretty elegantly handles the tasks of “I have a bunch of molecules, please run QM calculations on these” and, later, “please go fetch for me QM calculations on these molecules”
Some of our calculations are run by grad students and postdocs running “QC managers” as cluster jobs at their respective universities. But something like half of our total compute is run on Nautilus. It’s a uniquely suitable compute backend to pair with the NSF-funded MolSSI QCArchive project, which was designed to take advantage of preemptible compute. Our datasets consist of hundreds to millions of jobs, each requiring tens to thousands of CPU-hours and 8-32 GB of RAM.
And so, in combination with the efforts of modeling scientists and engineers, the enormous amount of training data generated by PRP has helped us release continuous improvements to our force fields, such that our models are now comparable to other academic models with decades of development, and we’re closing in on the accuracy of models developed by for-profit chemical modeling software vendors.