How Much LLM Does a Self-Revising Agent Actually Need?
TLDR
This paper externalizes LLM agent reflection to empirically study the marginal impact of LLMs on planning and revision.
Key contributions
- Introduced a "declared reflective runtime protocol" to externalize agent state, confidence, and hypothetical transitions.
- Decomposed agent competence into belief tracking, world-model planning, symbolic reflection, and sparse LLM revision.
- Found explicit world-model planning significantly improves win rate (+24.1pp) over greedy posterior-following baselines.
- Showed sparse LLM revision (4.3% of turns) had a small, non-monotonic impact, slightly increasing F1 but dropping win rate.
Why it matters
This paper provides a crucial framework to empirically disentangle LLM contributions from explicit structural components in self-revising agents. It shows structured planning is highly effective, while sparse LLM revision offers limited marginal benefit. This helps design more efficient and interpretable LLM-based agents.
Original Abstract
Recent LLM-based agents often place world modeling, planning, and reflection inside a single language model loop. This can produce capable behavior, but it makes a basic scientific question difficult to answer: which part of the agent's competence actually comes from the LLM, and which part comes from explicit structure around it? We study this question not by claiming a general answer, but by making it empirically tractable. We introduce a declared reflective runtime protocol that externalizes agent state, confidence signals, guarded actions, and hypothetical transitions into inspectable runtime structure. We instantiate this protocol in a declarative runtime and evaluate it on noisy Collaborative Battleship [4] using four progressively structured agents over 54 games (18 boards $\times$ 3 seeds). The resulting decomposition isolates four components: posterior belief tracking, explicit world-model planning, symbolic in-episode reflection, and sparse LLM-based revision. Across this decomposition, explicit world-model planning improves substantially over a greedy posterior-following baseline (+24.1pp win rate, +0.017 F1). Symbolic reflection operates as a real runtime mechanism -- with prediction tracking, confidence gating, and guarded revision actions -- even though its current revision presets are not yet net-positive in aggregate. Adding conditional LLM revision at about 4.3\% of turns yields only a small and non-monotonic change: average F1 rises slightly (+0.005) while win rate drops (31$\rightarrow$29 out of 54). These results suggest a methodological contribution rather than a leaderboard claim: externalizing reflection turns otherwise latent agent behavior into inspectable runtime structure, allowing the marginal role of LLM intervention to be studied directly.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.