ArXiv TLDR

The Ideation Bottleneck: Decomposing the Quality Gap Between AI-Generated and Human Economics Research

🐦 Tweet
2604.03338

Ning Li

econ.GNcs.AIcs.CY

TLDR

AI-generated economics research lags human work primarily due to a significant gap in idea quality, not just execution.

Key contributions

  • AI economics research significantly underperforms human papers in overall quality.
  • The quality gap is decomposed, with idea quality accounting for 71% and execution quality for 29%.
  • Human papers show much higher idea quality (47.1% vs. 16.5% exceptional probability).
  • Execution quality also differs, with AI weakest in mechanism analysis depth, but not robustness.

Why it matters

This paper quantifies the specific weaknesses of AI in generating research, highlighting that creative ideation, not just technical execution, is the main hurdle. It provides a clear roadmap for improving AI's utility in research by focusing on idea generation.

Original Abstract

Autonomous AI systems can now generate complete economics research papers, but they substantially underperform human-authored publications in head-to-head comparisons. This paper decomposes the quality gap into two independent components: research idea quality and execution quality. Using a two-model ensemble of fine-tuned language models trained on publication decisions (Gong, Li, and Zhou, 2026) to evaluate idea quality and a comprehensive six-dimension rubric assessed by Gemini 3.1 Flash Lite -- the same model family used as the APE tournament judge, ensuring methodological consistency -- to evaluate execution quality, we analyze 953 economics papers -- 912 AI-generated papers from the APE project and 41 human papers published in the American Economic Review and AEJ: Economic Policy. The idea quality gap is large (Cohen's d = 2.23, p < 0.001), with human papers achieving 47.1% mean ensemble exceptional probability versus 16.5% for AI. The execution quality gap is also significant but smaller (d = 0.90, p < 0.001), with human papers scoring 4.38/5.0 versus 3.84. Idea quality accounts for approximately 71% of the overall quality difference, with execution contributing 29%. The largest execution weakness is mechanism analysis depth (d = 1.43); no significant difference is found on robustness. We document that 74% of AI papers employ difference-in-differences, and only 7 AI papers (0.8%) surpass the median human paper on both idea and execution quality simultaneously. The primary bottleneck to competitive AI-generated economics research remains ideation.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.