From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs

April 15, 20262604.14137

Itay Itzhak, Eliya Habba, Gabriel Stanovsky, Yonatan Belinkov

cs.CLcs.AIcs.LG

TLDR

This paper formalizes user "vibe-testing" of LLMs, developing a pipeline that personalizes evaluation to better reflect real-world usefulness beyond benchmarks.

Key contributions

Analyzed user LLM evaluation practices through surveys and in-the-wild comparison reports.
Formalized "vibe-testing" as personalizing both what is tested and how responses are judged.
Developed a proof-of-concept pipeline for personalized LLM evaluation with user-aware criteria.
Demonstrated personalized evaluation can alter preferred models, reflecting real-world utility.

Why it matters

Current LLM benchmarks often fail to capture real-world usefulness. This paper offers a systematic approach to formalize user-centric "vibe-testing" for more relevant evaluations. It bridges the gap between academic metrics and practical utility, leading to better LLM assessment.

Original Abstract

Evaluating LLMs is challenging, as benchmark scores often fail to capture models' real-world usefulness. Instead, users often rely on ``vibe-testing'': informal experience-based evaluation, such as comparing models on coding tasks related to their own workflow. While prevalent, vibe-testing is often too ad hoc and unstructured to analyze or reproduce at scale. In this work, we study how vibe-testing works in practice and then formalize it to support systematic analysis. We first analyze two empirical resources: (1) a survey of user evaluation practices, and (2) a collection of in-the-wild model comparison reports from blogs and social media. Based on these resources, we formalize vibe-testing as a two-part process: users personalize both what they test and how they judge responses. We then introduce a proof-of-concept evaluation pipeline that follows this formulation by generating personalized prompts and comparing model outputs using user-aware subjective criteria. In experiments on coding benchmarks, we find that combining personalized prompts and user-aware evaluation can change which model is preferred, reflecting the role of vibe-testing in practice. These findings suggest that formalized vibe-testing can serve as a useful approach for bridging benchmark scores and real-world experience.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers