AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents

May 11, 20262605.10876

Edward De Brouwer, Carl Edwards, Alexander Wu, Jenna Collier, Graham Heimberg + 7 more

cs.LGcs.AIq-bio.QM

TLDR

AssayBench is a new benchmark for phenotypic screen prediction in virtual cell models, evaluating LLMs and agents on diverse cellular phenotypes.

Key contributions

Introduces AssayBench, a new benchmark for phenotypic screen prediction in virtual cell models.
Comprises 1,920 public CRISPR screens covering five broad classes of cellular phenotypes.
Defines screen prediction as gene rank prediction and introduces the adjusted nDCG metric.
Shows generalist LLMs outperform biology-specific models; optimization improves performance.

Why it matters

This paper fills a critical gap by providing the first standard benchmark for in silico phenotypic screening, a key task for drug discovery and virtual cell models. It enables robust evaluation of LLMs and agentic systems, accelerating progress toward computational models of cellular behavior.

Original Abstract

Recent advances in machine learning and large-scale biological data collections have revived the prospect of building a virtual cell, a computational model of cellular behavior that could accelerate biological discovery. One of the most compelling promises of this vision is the ability to perform in silico phenotypic screens, in which a model predicts the effects of cellular perturbations in unseen biological contexts. This task combines heterogeneous textual inputs with diverse phenotypic outputs, making it particularly well-suited to LLMs and agentic systems. Yet, no standard benchmark currently exists for this task, as existing efforts focus on narrower molecular readouts that are only indirectly aligned with the phenotypic endpoints driving many real-world drug discovery workflows. In this work, we present AssayBench, a benchmark for phenotypic screen prediction, built from 1,920 publicly available CRISPR screens spanning five broad classes of cellular phenotypes. We formulate the screen prediction task as a gene rank prediction for each screen and introduce the adjusted nDCG, a continuous metric for comparing performance across heterogeneous assays. Our extensive evaluation shows that existing methods remain far from empirically estimated performance ceilings and zero-shot generalist LLMs outperform biology-specific LLMs and trainable baselines. Optimization techniques such as fine-tuning, ensembling, and prompt optimization can further improve LLM performance on this task. Overall, AssayBench offers a practical testbed for measuring progress toward in silico phenotypic screening and, more broadly, virtual cell models.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers