ArXiv TLDR

Learning CLI Agents with Structured Action Credit under Selective Observation

🐦 Tweet
2605.08013

Haoyang Su, Ying Wen

cs.AI

TLDR

This paper introduces methods for CLI agents to learn from structured action credit and selective observations, improving performance on shell tasks.

Key contributions

  • Introduces σ-Reveal, an inference-time mechanism for selecting token-budgeted context for CLI agents.
  • Proposes Action Advantage Assignment (A^3), an RL method for credit assignment using AST-based action sub-chain residuals.
  • Constructs ShellOps, a new verifiable dataset suite for evaluating CLI tasks in repository environments.

Why it matters

This work tackles key challenges in training CLI agents, specifically handling large codebases and sparse rewards. By improving how agents observe and assign credit, it advances the development of more effective and practical agent-computer interaction.

Original Abstract

Command line interface (CLI) agents are emerging as a practical paradigm for agent-computer interaction over evolving filesystems, executable command line programs, and online execution feedback. Recent work has used reinforcement learning (RL) to learn these interaction abilities from verifiable task feedback, yet few methods exploit the native structured attributes of CLI actions as learning signals. Beyond this underused action structure, CLI learning also couples two bottlenecks for coding agents. First, the agent must identify task-relevant evidence in a large codebase from partial observations. Second, sparse terminal rewards must be assigned to the actions that shape a long multi-turn trajectory. We study these bottlenecks through shell-driven information extraction and file editing tasks. For selective observation, we introduce $σ$-Reveal, an inference-time mechanism that selects token-budgeted context for the same CLI. For credit assignment, we propose Action Advantage Assignment ($\mathrm{A}^3$), a native agentic RL method that preserves the algorithmic complexity of standard agentic RL. $\mathrm{A}^3$ constructs turn-level advantages from episode-level relative feedback, abstract syntax tree (AST) based action sub-chain residuals, and tree-level trajectory margins. To further evaluate this problem setting, we construct ShellOps, a verifiable dataset suite covering CLI tasks in repository environments.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.