LLM-Flax : Generalizable Robotic Task Planning via Neuro-Symbolic Approaches with Large Language Models
TLDR
LLM-Flax is a neuro-symbolic framework using LLMs to automate robotic task planning, eliminating manual effort and outperforming baselines.
Key contributions
- Automates rule generation for neuro-symbolic planners via structured LLM prompting and self-correction.
- Implements LLM-guided failure recovery with a feasibility-gated budget policy for robust planning.
- Replaces GNN object scorers with zero-shot LLM importance scoring, eliminating training data needs.
- Outperforms manual baselines on MazeNamo, achieving 0.945 average success rate (+0.117).
Why it matters
This paper significantly reduces the manual effort required to deploy neuro-symbolic robotic task planners. By leveraging LLMs for rule generation, failure recovery, and object scoring, it makes these systems more generalizable and accessible.
Original Abstract
Deploying a neuro-symbolic task planner on a new domain today requires significant manual effort: a domain expert must author relaxation and complementary rules, and hundreds of training problems must be solved to supervise a Graph Neural Network (GNN) object scorer. We propose LLM-Flax, a three-stage framework that eliminates all three sources of manual effort using a locally hosted LLM given only a PDDL domain file. Stage 1 automatically generates relaxation and complementary rules via structured prompting with format validation and self-correction. Stage 2 introduces LLM-guided failure recovery with a feasibility-gated budget policy that explicitly reserves API latency cost before each LLM call, preventing the downstream relaxation fallback from being starved. Stage 3 replaces the domain-trained GNN entirely with zero-shot LLM object importance scoring, requiring no training data. We evaluate all three stages on the MazeNamo benchmark across 10x10, 12x12, and 15x15 grids (8 benchmarks total). LLM-Flax achieves average SR 0.945 versus the manual baseline's 0.828 (+0.117), matching or outperforming manual rules on every one of the eight benchmarks. On 12x12 Expert, LLM-Flax attains SR 0.733 where the manual planner fails entirely (SR 0.000); on 15x15 Hard, it achieves SR 1.000 versus Manual's 0.900. Stage 3 demonstrates feasibility (SR 0.720 on 12x12 Hard with no training data) but faces a context-window bottleneck at scale, pointing to the primary open challenge for future work.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.