Machine Learning Potentials¶
Machine Learning Interatomic Potentials (MLIPs) are an increasingly popular method of simulating molecular systems. In both speed and accuracy, they are intermediate between classical force fields on one hand and high level quantum chemistry methods on the other. The OpenMM-ML package provides a simple, easy to use interface for using them in OpenMM.
MLIPs cover a large class of models. They begin with some type of machine learning model, typically a neural network, often supplemented with standard physics based interactions. The parameters of the model are trained to reproduce a large library of forces and energies, which are typically generated with a highly accurate quantum chemistry method. The resulting model can approach the accuracy of the original method while being many times faster. Some MLIPs are specific to just one molecule, while others are trained to cover large areas of chemical space and be applicable to arbitrary systems with no further training. The latter are often known as “foundation models”. OpenMM-ML supports both types of models. This tutorial focuses on foundation models, which are much easier to use and more widely useful. Specialized models usually require you to train your own model for whatever system you want to simulate, a much larger task than just selecting a pre-trained, ready to use model. The tradeoff is that a specialized model for one specific molecule can be much smaller and faster than a general purpose model of similar accuracy.
While the details vary, a good rule of thumb is that MLIPs are about 1000 times slower than classical force fields, but about 1000 times faster than high level quantum chemistry methods. That might change in the future, but they are unlikely to replace force fields any time soon. They are mostly useful for relatively small systems, or small pieces of larger systems. When used in that way, they enable simulations that would be impossibly slow with conventional quantum mechanical methods.
Installing Models¶
OpenMM-ML provides a common interface for working with many different models, but it does not provide the models itself. They must be installed separately using whatever method the model developer has chosen. Most models can be installed with pip, which is therefore the preferred installation method to use.
The following command installs all packages needed for this tutorial, including two packages that provide models, orb-models and TorchMD-Net. We also install MDAnalysis, which is not needed to run simulations but is used in this tutorial for analyzing results.
pip install openmm[cuda13] openmmml orb-models torchmd-net mdanalysis
Running ML Simulations¶
With that out of the way, let’s try simulating something. First we import libraries that will be used during the simulation.
[1]:
from openmm import *
from openmm.app import *
from openmm.unit import *
from openmmml import MLPotential
import MDAnalysis as mda
from MDAnalysis.analysis.dihedrals import Ramachandran
import sys
import logging
logging.basicConfig(level=logging.ERROR)
As an example, let’s simulate alanine dipeptide in vacuum. We can load the model from a PDB file.
[2]:
pdb = PDBFile('alanine-dipeptide.pdb')
print(pdb.topology)
<Topology; 1 chains, 3 residues, 22 atoms, 21 bonds>
Now create a System for it. We use the AceFF 2.0 model. This is a relatively fast (as MLIPs go) but inaccurate potential.
[3]:
potential = MLPotential('aceff-2.0')
system = potential.createSystem(pdb.topology)
That’s all there is to it! MLPotential plays the same role in ML simulations that ForceField does in conventional simulations. You create an instance of it, specifying the name of the potential to use, then call createSystem() to create a System to simulate.
To prove we really have a working System, let’s run a short simulation.
[4]:
integrator = LangevinIntegrator(300*kelvin, 1.0/picosecond, 0.001*picoseconds)
simulation = Simulation(pdb.topology, system, integrator)
simulation.reporters.append(StateDataReporter(sys.stdout, 100, potentialEnergy=True, temperature=True, step=True, speed=True))
simulation.context.setPositions(pdb.positions)
simulation.step(1000)
#"Step","Potential Energy (kJ/mole)","Temperature (K)","Speed (ns/day)"
100,-844.458740234375,53.319062375260714,0
200,-839.6515502929688,82.89689504325784,42.2
300,-839.5911254882812,83.8807790637833,43
400,-823.9935302734375,94.66103127523544,42.3
500,-818.1985473632812,129.62461685886882,41.8
600,-814.7901000976562,119.25928243622221,42
700,-817.9552001953125,179.3754547066664,42.2
800,-822.4508666992188,228.6552988894026,42.3
900,-809.474365234375,193.80415636153424,42.5
1000,-817.1160278320312,222.27929697715678,42.3
Only about 42 ns/day for a molecule with 22 atoms. (That simulation was run on an NVIDIA RTX 4080 GPU.) I warned you that MLIPs were slow! On the other hand, we simulated 1000 steps in only a few seconds. Most quantum chemistry methods with similar accuracy would take that long to perform a single energy evaluation.
Let’s try something a bit more interesting: run a longer simulation and compute a Ramachandran plot.
[5]:
def simulate(system):
# Prepare a Simulation
integrator = LangevinIntegrator(300*kelvin, 1.0/picosecond, 0.001*picoseconds)
simulation = Simulation(pdb.topology, system, integrator)
simulation.context.setPositions(pdb.positions)
simulation.context.setVelocitiesToTemperature(300*kelvin)
# Equilibrate
simulation.step(1000)
# Generate data
simulation.reporters.append(XTCReporter('alanine-dipeptide.xtc', 100))
simulation.step(100000)
Let’s see how long it takes.
[6]:
%%time
simulate(system)
CPU times: user 3min 6s, sys: 8.75 s, total: 3min 15s
Wall time: 3min 16s
We can use MDAnalysis to do the analysis and generate the plot.
[7]:
u = mda.Universe('alanine-dipeptide.pdb', 'alanine-dipeptide.xtc')
ramachandran = Ramachandran(u.select_atoms("protein")).run()
ramachandran.plot()
[7]:
<Axes: xlabel='$\\phi$', ylabel='$\\psi$'>
Simulating Mixed ML/MM Systems¶
The behavior of alanine dipeptide (and most other organic molecules) is very different in vacuum from in water. We would prefer to simulate it in a box of water, but that requires many times more atoms. Simulating it entirely with an MLIP would be very, very slow.
Instead we can use a mixed system. We will use AceFF 2.0 only to compute the internal energy of the alanine dipeptide molecule. The internal energy of the water, and the interaction energy between the water and solute, will be computed with a classical force field.
First we load a PDB file with the solvated alanine dipeptide.
[8]:
pdb = PDBFile('alanine-dipeptide-water.pdb')
print(pdb.topology)
<Topology; 750 chains, 752 residues, 2269 atoms, 1519 bonds>
To create a mixed system, we first create an ordinary System that models everything with a conventional force field.
[9]:
ff = ForceField('amber19-all.xml', 'amber19/tip3pfb.xml')
mm_system = ff.createSystem(pdb.topology, nonbondedMethod=PME)
Now we use our MLPotential to create a new System that replaces a subset of interactions with the MLIP. We simply give it the conventional System, and a list of atom indices to model with ML.
[10]:
chains = list(pdb.topology.chains())
ml_atoms = [atom.index for atom in chains[0].atoms()]
ml_system = potential.createMixedSystem(pdb.topology, mm_system, ml_atoms)
Let’s simulate the mixed System and see how the speed compares to the other simulation.
[11]:
%%time
simulate(ml_system)
CPU times: user 10min 47s, sys: 12.1 s, total: 10min 59s
Wall time: 10min 58s
The Ramachandran plot looks quite different from before, showing the importance of water.
[12]:
u = mda.Universe('alanine-dipeptide-water.pdb', 'alanine-dipeptide.xtc')
ramachandran = Ramachandran(u.select_atoms("protein")).run()
ramachandran.plot()
[12]:
<Axes: xlabel='$\\phi$', ylabel='$\\psi$'>
Other Models¶
For these simulations we used a fast (compared to many other MLIPs) but inaccurate (compared to many other MLIPs) model. In some cases, you may prefer a different tradeoff between speed and accuracy. Let’s try simulating alanine dipeptide in vacuum again, but this time using orb-v3-conservative-omol. This is a much more accurate but correspondingly slower potential.
The code is exactly the same as before. The only change is the name of the potential function passed to MLPotential.
[13]:
pdb = PDBFile('alanine-dipeptide.pdb')
potential = MLPotential('orb-v3-conservative-omol')
system = potential.createSystem(pdb.topology)
integrator = LangevinIntegrator(300*kelvin, 1.0/picosecond, 0.001*picoseconds)
simulation = Simulation(pdb.topology, system, integrator)
simulation.reporters.append(StateDataReporter(sys.stdout, 100, potentialEnergy=True, temperature=True, step=True, speed=True))
simulation.context.setPositions(pdb.positions)
simulation.step(1000)
#"Step","Potential Energy (kJ/mole)","Temperature (K)","Speed (ns/day)"
100,-1301870.472026293,61.02413538058426,0
200,-1301864.064797156,107.8435303300016,6.13
300,-1301860.3900627983,144.32587640266559,6.15
400,-1301860.2958388403,183.82281060477646,6.13
500,-1301847.9525003564,128.49887543657866,6.12
600,-1301845.8795732828,156.1019706331902,6.12
700,-1301831.2748598093,157.93735591078925,6.12
800,-1301834.290026462,208.32715287948798,6.13
900,-1301831.2748598093,215.42493142754162,6.14
1000,-1301830.8979639777,185.22360055881106,6.14
There is no universal best choice for what model to use. You need to pick one that fits the requirements of your specific application.
Going Further¶
OpenMM-ML supports lots of other MLIPs. That includes pretrained foundation models, as well as models you train yourself with popular frameworks such as MACE and NequIP.
Particular models may have additional options you can specify to control their behavior. For example, both AceFF and Orb let you specify the charge of the system to simulate. This allows you to use it for simulating charged molecules:
system = potential.createSystem(pdb.topology, charge=1)
See the documentation for details on how to use these features.