Making Room for AI: Multi-GPU Molecular Dynamics with Deep Potentials in GROMACS
Luca Pennati, Andong Hu, Ivy Peng, Lukas Müllender, Stefano Markidis
TLDR
This paper integrates DeePMD-kit into GROMACS, enabling scalable, multi-GPU molecular dynamics with AI-driven potentials for near-quantum accuracy.
Key contributions
- Integrates DeePMD-kit into GROMACS, extending NNPot interface for multi-GPU, AI-driven MD simulations.
- Introduces a decoupled domain decomposition layer for concurrent inference across multi-node systems.
- Achieves strong-scaling efficiency of 66% at 16 GPUs and 40% at 32 GPUs on A100/MI250x.
- Identifies DeePMD inference (>90% wall time) and ghost-atom cost as primary bottlenecks.
Why it matters
This integration allows GROMACS users to perform large-scale molecular dynamics with AI-driven potentials, bridging the gap between quantum accuracy and MD throughput. It opens new avenues for simulating complex systems with unprecedented fidelity.
Original Abstract
GROMACS is a de-facto standard for classical Molecular Dynamics (MD). The rise of AI-driven interatomic potentials that pursue near-quantum accuracy at MD throughput now poses a significant challenge: embedding neural-network inference into multi-GPU simulations retaining high-performance. In this work, we integrate the MLIP framework DeePMD-kit into GROMACS, enabling domain-decomposed, GPU-accelerated inference across multi-node systems. We extend the GROMACS NNPot interface with a DeePMD backend, and we introduce a domain decomposition layer decoupled from the main simulation. The inference is executed concurrently on all processes, with two MPI collectives used each step to broadcast coordinates and to aggregate and redistribute forces. We train an in-house DPA-1 model (1.6 M parameters) on a dataset of solvated protein fragments. We validate the implementation on a small protein system, then we benchmark the GROMACS-DeePMD integration with a 15,668 atom protein on NVIDIA A100 and AMD MI250x GPUs up to 32 devices. Strong-scaling efficiency reaches 66% at 16 devices and 40% at 32; weak-scaling efficiency is 80% to 16 devices and reaches 48% (MI250x) and 40% (A100) at 32 devices. Profiling with the ROCm System profiler shows that >90% of the wall time is spent in DeePMD inference, while MPI collectives contribute <10%, primarily since they act as a global synchronization point. The principal bottlenecks are the irreducible ghost-atom cost set by the cutoff radius, confirmed by a simple throughput model, and load imbalance across ranks. These results demonstrate that production MD with near ab initio fidelity is feasible at scale in GROMACS.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.