Evaluating the Progression of Large Language Model Capabilities for Small-Molecule Drug Design

April 17, 20262604.16279

Shriram Chennakesavalu, Kirill Shmilovich, Hayley Weir, Colin Grambow, John Bradshaw + 3 more

cs.LGphysics.chem-ph

TLDR

This paper evaluates LLM capabilities for small-molecule drug design using new RL-based benchmarks and shows post-training significantly improves performance.

Key contributions

Introduced new chemically-grounded tasks for LLM evaluation in small-molecule drug design.
Formulated these tasks as reinforcement learning (RL) environments for unified assessment and post-training.
Found frontier LLMs are increasingly proficient but need improvement, especially in low-data experimental settings.
Showed RL-based post-training significantly boosts performance, making smaller models competitive with SOTA.

Why it matters

This work provides a practical framework for evaluating and enhancing LLMs in drug discovery. By identifying and closing capability gaps through targeted post-training, it paves the way for more effective use of LLMs in real-world scenarios.

Original Abstract

Large Language Models (LLMs) have the potential to accelerate small molecule drug design due to their ability to reason about information from diverse sources and formats. However, their practical utility remains unclear due to the lack of benchmarks that reflect real-world scenarios. In this work, we introduce a suite of chemically-grounded tasks spanning molecular property prediction, molecular representation transformations, and molecular design. Importantly, we formulate these tasks as reinforcement learning (RL) environments, enabling a unified approach for evaluation and post-training. Across three model families, we find that frontier models are increasingly proficient at chemical tasks, but that there is significant room for improvement, especially in experimental settings with low data. Critically, we show that RL-based post-training can substantially improve performance. A smaller model post-trained on our environments becomes competitive with state-of-the-art frontier models, despite a significantly weaker base model. This suggests a practical route toward employing LLMs in drug discovery; by combining carefully-designed evaluation tasks with targeted post-training, we can both elucidate and close critical capability gaps.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers