CommandSwarm: Safety-Aware Natural Language-to-Behavior-Tree Generation for Robotic Swarms

May 8, 20262605.07764

cs.RO

TLDR

CommandSwarm enables safety-aware natural language control of robotic swarms by generating validated behavior trees using adapted LLMs.

Key contributions

Introduces CommandSwarm, a safety-aware pipeline for generating XML behavior trees from natural language.
Integrates multilingual translation, safety filtering, constrained prompting, LoRA-adapted LLMs, and parser validation.
Demonstrates compact, quantized LLMs (e.g., Falcon3-Instruct-10B) can generate useful swarm BTs.
LoRA adaptation improved zero-shot BLEU from 0.267 to 0.663 and syntactic validity from 0% to 72%.

Why it matters

This paper makes swarm robotics more accessible to non-experts by allowing natural language control. It addresses critical safety concerns by integrating robust validation and filtering into the generation pipeline. The findings show that even compact, adapted LLMs can be effective for complex robotic control.

Original Abstract

Natural-language interfaces can make swarm robotics more accessible to non-expert operators, but they must translate ambiguous user intent into executable swarm behaviors without unsupported actions, malformed programs, or unsafe plans. This paper presents CommandSwarm, a safety-aware language-to-behavior-tree pipeline for generating XML behavior trees (BTs) from speech or text commands. The system combines multilingual translation, command-level safety filtering, constrained prompting, a LoRA-adapted large language model (LLM), and deterministic parser validation against a whitelist of executable swarm primitives. We evaluate eleven open 6.7B--14B parameter LLMs, all using 4-bit quantization, on representative swarm-control scenarios under zero-shot, one-shot, and two-shot prompting. Falcon3-Instruct-10B and Mistral-7B-v3 are the strongest prompt-engineered candidates, reaching BLEU scores above 0.60 and high syntactic validity in few-shot settings. LoRA adaptation of Falcon3-Instruct-10B on a 2,063-example synthetic instruction--BT corpus improves zero-shot BLEU from 0.267 to 0.663, ROUGE-L from 0.366 to 0.692, and parser-accepted syntactic validity from 0% to 72%. Translation experiments further show that SeamlessM4T v2-large and EuroLLM-9B provide the best quality-latency trade-offs for the multilingual front end. The results indicate that compact, quantized, domain-adapted LLMs can generate useful swarm BTs when embedded in a validated systems pipeline. They also show that parser acceptance and safety filtering remain necessary execution gates; generation quality alone is not sufficient for autonomous deployment.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers