ArXiv TLDR

Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement

🐦 Tweet
2604.22517

Wataru Hirota, Tomoki Taniguchi, Tomoko Ohkuma, Kosuke Takahashi, Takahiro Omi + 4 more

cs.CL

TLDR

Personalized judges align better with experts than aggregate judges in evaluating business ideas, using a new dataset, PBIG-DATA.

Key contributions

  • Introduces PBIG-DATA, a dataset of 3,000 expert scores for 300 business ideas across 6 dimensions.
  • Reveals substantial expert disagreement on fine-grained scores, indicating structured heterogeneity.
  • Compares rubric-only, aggregate, and personalized judge configurations for business idea evaluation.
  • Demonstrates personalized judges align more closely with individual experts than aggregate judges.

Why it matters

Evaluating LLM-generated business ideas is challenging due to complex criteria and expert disagreement. This research offers a crucial methodological insight by showing that personalized AI judges, which learn from individual expert histories, are more effective than aggregate models. This approach could significantly improve the reliability and scalability of business idea assessment.

Original Abstract

Evaluating LLM-generated business ideas is often harder to scale than generating them. Unlike standard NLP benchmarks, business idea evaluation relies on multi-dimensional criteria such as feasibility, novelty, differentiation, user need, and market size, and expert judgments often disagree. This paper studies a methodological question raised by such disagreement: should an automatic judge approximate an aggregate consensus, or model evaluators individually? We introduce PBIG-DATA, a dataset of approximately 3,000 individual scores across 300 patent-grounded product ideas, provided by domain experts on six business-oriented dimensions: specificity, technical validity, innovativeness, competitive advantage, need validity, and market size. Analyses show substantial expert disagreement on fine-grained ordinal scores, while agreement is higher under coarse selection, suggesting structured heterogeneity rather than random noise. We then compare three judge configurations: a rubric-only zero-shot judge, an aggregate judge conditioned on mixed evaluator histories, and a personalized judge conditioned on the target evaluator's scoring history. Across dimensions and model sizes, personalized judges align more closely with the corresponding evaluator than aggregate judges, and evaluator agreement correlates with similarity of judge-generated reasoning only under personalized conditioning. These results indicate that pooled labels can be a fragile target in pluralistic evaluation settings and motivate evaluator-conditioned judge designs for business idea assessment.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.