VibeProteinBench: An Evaluation Benchmark for Language-interfaced Vibe Protein Design

May 9, 20262605.10978

Hyunjin Seo, Hongjoon Ahn, Jimin Park, Sungjun Han, Gyubok Lee + 15 more

q-bio.QM

TLDR

VibeProteinBench is a new benchmark evaluating language-interfaced LLMs in generalist protein design across recognition, engineering, and generation.

Key contributions

Introduces VibeProteinBench, a language-interfaced benchmark for generalist protein design.
Evaluates LLMs across three key protein design stages: recognition, engineering, and generation.
Incorporates expert-curated rationales and in silico validation for biological plausibility.
Shows current LLMs lack strong generalist protein design capabilities across all stages.

Why it matters

Existing benchmarks for protein design are often limited in scope or input. VibeProteinBench offers a comprehensive, integrated framework to assess LLMs' broad capabilities in protein design, highlighting current limitations and guiding future research in this critical area.

Original Abstract

Protein design aims to compose amino-acid sequences that fold into stable three-dimensional structures while satisfying targeted functional properties. The field is increasingly shifting toward vibe protein design, where a single model is expected to generate novel sequences, engineer existing proteins, and reason about protein characteristics through flexible natural-language constraints. Large language models (LLMs) have emerged as a leading paradigm in this space. However, existing evaluation benchmarks often limit their scope to a partial aspect of protein design, while others restrict design objectives to structured input schemas, lacking an integrated framework that evaluates the broad spectrum of protein design competence under open-ended intents. To this end, we present Vibe Protein design Benchmark (VibeProteinBench), a language-interfaced benchmark that probes generalist capabilities through three complementary stages mirroring a computational protein design workflow: recognition, engineering, and generation. Each stage is grounded in expert-curated mechanistic rationales and multi-faceted in silico validation, to computationally verify whether model outputs are biologically plausible. Evaluations across diverse general-purpose and domain-specialized LLMs reveal that no model achieves strong performance across all three stages, suggesting that generalist protein design remains a substantial open challenge for current LLMs.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers