ArXiv TLDR

VibeProteinBench: An Evaluation Benchmark for Language-interfaced Vibe Protein Design

🐦 Tweet
2605.10978

Hyunjin Seo, Hongjoon Ahn, Jimin Park, Sungjun Han, Gyubok Lee + 15 more

q-bio.QM

TLDR

VibeProteinBench is a new benchmark evaluating language-interfaced LLMs in generalist protein design across recognition, engineering, and generation.

Key contributions

  • Introduces VibeProteinBench, a language-interfaced benchmark for generalist protein design.
  • Evaluates LLMs across three key protein design stages: recognition, engineering, and generation.
  • Incorporates expert-curated rationales and in silico validation for biological plausibility.
  • Shows current LLMs lack strong generalist protein design capabilities across all stages.

Why it matters

Existing benchmarks for protein design are often limited in scope or input. VibeProteinBench offers a comprehensive, integrated framework to assess LLMs' broad capabilities in protein design, highlighting current limitations and guiding future research in this critical area.

Original Abstract

Protein design aims to compose amino-acid sequences that fold into stable three-dimensional structures while satisfying targeted functional properties. The field is increasingly shifting toward vibe protein design, where a single model is expected to generate novel sequences, engineer existing proteins, and reason about protein characteristics through flexible natural-language constraints. Large language models (LLMs) have emerged as a leading paradigm in this space. However, existing evaluation benchmarks often limit their scope to a partial aspect of protein design, while others restrict design objectives to structured input schemas, lacking an integrated framework that evaluates the broad spectrum of protein design competence under open-ended intents. To this end, we present Vibe Protein design Benchmark (VibeProteinBench), a language-interfaced benchmark that probes generalist capabilities through three complementary stages mirroring a computational protein design workflow: recognition, engineering, and generation. Each stage is grounded in expert-curated mechanistic rationales and multi-faceted in silico validation, to computationally verify whether model outputs are biologically plausible. Evaluations across diverse general-purpose and domain-specialized LLMs reveal that no model achieves strong performance across all three stages, suggesting that generalist protein design remains a substantial open challenge for current LLMs.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.