An Empirical Study of Proactive Coding Assistants in Real-World Software Development
Lehui Li, Ruixuan Jia, Guo-Ye Yang, Jia Li
TLDR
This study reveals a significant simulation-to-reality gap in proactive coding assistant research, introducing ProCodeBench, a real-world benchmark for intent prediction.
Key contributions
- Collected real IDE interaction data from 1,246 industry developers over three days.
- Revealed substantial differences between LLM-simulated and real developer traces in behavior and patterns.
- Introduced ProCodeBench, a new real-world benchmark for proactive intent prediction.
- Demonstrated that current LLMs over-estimate performance when evaluated on simulated data.
Why it matters
This paper is crucial for advancing proactive coding assistants, revealing that simulated data misrepresents real developer behavior. It provides ProCodeBench, a vital real-world benchmark, and demonstrates that current LLMs need real data for effective training and evaluation, guiding future research.
Original Abstract
Large language model (LLM)-based coding assistants have made substantial progress, yet most systems remain reactive, requiring developers to explicitly formulate their needs. Proactive coding assistants aim to infer latent developer intent from integrated development environment (IDE) interactions and repository context, thereby reducing interaction overhead and supporting more seamless assistance. However, research in this direction is limited by the scarcity of large-scale real-world developer behavior data. Existing studies therefore often rely on LLM-simulated IDE traces, whose fidelity to real development behavior remains unclear. In this paper, we investigate this simulation-to-reality gap through a large-scale empirical study. We collect real IDE interaction traces from 1{,}246 experienced industry developers over three consecutive days using a custom Visual Studio Code extension, and construct paired LLM-simulated traces for controlled comparison. Our analysis shows that simulated traces differ substantially from real traces in behavioral diversity, temporal structure, and exploratory patterns. Based on the collected data, we introduce \textbf{ProCodeBench}, a real-world benchmark for proactive intent prediction. Experiments with representative LLMs, retrieval-augmented methods, and agentic baselines show that current approaches remain far from reliable under real IDE traces, suggesting that simulation-based evaluation can overestimate real-world performance. Finally, our training study shows that simulated data cannot replace real data, but can complement it when used before real-world fine-tuning. These findings highlight the importance of real developer behavior data for evaluating and training proactive coding assistants.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.