From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution

April 16, 20262604.15097

cs.SEcs.CL

TLDR

This paper introduces "Gene" representation for reusable experience, outperforming "Skill" packages for test-time control and iterative evolution.

Key contributions

Documentation-oriented "Skill" packages provide unstable control for experience reuse.
A compact "Gene" representation yields stronger, more stable control for experience reuse.
"Gene" is a better carrier for iterative experience accumulation and evolution than "Skill".
Gene-evolved systems significantly improve performance (e.g., 9.1% to 18.57%) in code-solving.

Why it matters

This paper redefines the problem of experience reuse, shifting focus from supplying more experience to encoding it effectively. By proposing the "Gene" representation, it offers a compact, control-oriented, and evolution-ready object that significantly boosts performance. This work is crucial for developing more robust and adaptable AI systems capable of learning from past interactions.

Original Abstract

This beta technical report asks how reusable experience should be represented so that it can function as effective test-time control and as a substrate for iterative evolution. We study this question in 4.590 controlled trials across 45 scientific code-solving scenarios. We find that documentation-oriented Skill packages provide unstable control: their useful signal is sparse, and expanding a compact experience object into a fuller documentation package often fails to help and can degrade the overall average. We further show that representation itself is a first-order factor. A compact Gene representation yields the strongest overall average, remains competitive under substantial structural perturbations, and outperforms matched-budget Skill fragments, while reattaching documentation-oriented material usually weakens rather than improves it. Beyond one-shot control, we show that Gene is also a better carrier for iterative experience accumulation: attached failure history is more effective in Gene than in Skill or freeform text, editable structure matters beyond content alone, and failure information is most useful when distilled into compact warnings rather than naively appended. On CritPt, gene-evolved systems improve over their paired base models from 9.1% to 18.57% and from 17.7% to 27.14%. These results suggest that the core problem in experience reuse is not how to supply more experience, but how to encode experience as a compact, control-oriented, evolution-ready object.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers