ArXiv TLDR

Layered Mutability: Continuity and Governance in Persistent Self-Modifying Agents

🐦 Tweet
2604.14717

Krti Tallam

cs.AIcs.CRcs.CYcs.LG

TLDR

This paper introduces Layered Mutability, a framework to understand and govern persistent self-modifying AI agents, highlighting compositional drift.

Key contributions

  • Introduces "layered mutability," a framework for understanding self-modifying agents across five layers.
  • Pinpoints factors like rapid mutation and low observability that increase agent governance challenges.
  • Formalizes drift, governance-load, and hysteresis quantities for agent behavior.
  • Reports a "ratchet experiment" demonstrating identity hysteresis (0.68 ratio) in agent self-description.

Why it matters

This paper is crucial for understanding the governance of persistent self-modifying AI agents. It reveals that the main risk is not sudden misalignment, but gradual, compositional drift from authorized behavior. This framework provides tools to anticipate and manage such complex AI system evolution.

Original Abstract

Persistent language-model agents increasingly combine tool use, tiered memory, reflective prompting, and runtime adaptation. In such systems, behavior is shaped not only by current prompts but by mutable internal conditions that influence future action. This paper introduces layered mutability, a framework for reasoning about that process across five layers: pretraining, post-training alignment, self-narrative, memory, and weight-level adaptation. The central claim is that governance difficulty rises when mutation is rapid, downstream coupling is strong, reversibility is weak, and observability is low, creating a systematic mismatch between the layers that most affect behavior and the layers humans can most easily inspect. I formalize this intuition with simple drift, governance-load, and hysteresis quantities, connect the framework to recent work on temporal identity in language-model agents, and report a preliminary ratchet experiment in which reverting an agent's visible self-description after memory accumulation fails to restore baseline behavior. In that experiment, the estimated identity hysteresis ratio is 0.68. The main implication is that the salient failure mode for persistent self-modifying agents is not abrupt misalignment but compositional drift: locally reasonable updates that accumulate into a behavioral trajectory that was never explicitly authorized.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.