ArXiv TLDR

CodeGraphVLP: Code-as-Planner Meets Semantic-Graph State for Non-Markovian Vision-Language-Action Models

🐦 Tweet
2604.22238

Khoa Vo, Sieu Tran, Taisei Hanyu, Yuki Ikebe, Duy Nguyen + 6 more

cs.RO

TLDR

CodeGraphVLP enhances VLA models for long-horizon robot tasks by combining a semantic graph, code-based planner, and progress-guided prompting.

Key contributions

  • Introduces CodeGraphVLP, a hierarchical framework for long-horizon non-Markovian robot manipulation.
  • Uses a persistent semantic-graph to maintain task-relevant state under partial observability.
  • Employs an executable code-based planner for efficient progress checks and subtask instruction generation.
  • Constructs clutter-suppressed observations for the VLA executor using planner outputs.

Why it matters

VLA models struggle with long, non-Markovian tasks due to partial observability and clutter. CodeGraphVLP offers a robust solution by integrating a semantic graph and code-based planning. This significantly improves task completion and efficiency, making VLA models more practical for complex real-world robot manipulation.

Original Abstract

Vision-Language-Action (VLA) models promise generalist robot manipulation, but are typically trained and deployed as short-horizon policies that assume the latest observation is sufficient for action reasoning. This assumption breaks in non-Markovian long-horizon tasks, where task-relevant evidence can be occluded or appear only earlier in the trajectory, and where clutter and distractors make fine-grained visual grounding brittle. We present CodeGraphVLP, a hierarchical framework that enables reliable long-horizon manipulation by combining a persistent semantic-graph state with an executable code-based planner and progress-guided visual-language prompting. The semantic-graph maintains task-relevant entities and relations under partial observability. The synthesized planner executes over this semantic-graph to perform efficient progress checks and outputs a subtask instruction together with subtask-relevant objects. We use these outputs to construct clutter-suppressed observations that focus the VLA executor on critical evidence. On real-world non-Markovian tasks, CodeGraphVLP improves task completion over strong VLA baselines and history-enabled variants while substantially lowering planning latency compared to VLM-in-the-loop planning. We also conduct extensive ablation studies to confirm the contributions of each component.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.