A Semantic Autonomy Framework for VLM-Integrated Indoor Mobile Robots: Hybrid Deterministic Reasoning and Cross-Robot Adaptive Memory

May 4, 20262605.02525

Bogdan Felician Abaza, Andrei-Alexandru Staicu, Cristian Vasile Doicin

cs.ROcs.AI

TLDR

This paper introduces a semantic autonomy framework for indoor robots using hybrid VLM reasoning and adaptive memory to interpret natural language instructions.

Key contributions

Presents a six-layer Semantic Autonomy Stack for VLM-integrated indoor robot navigation.
Employs hybrid deterministic-VLM reasoning, resolving 88% of instructions instantly without VLM or GPU.
Features cross-robot adaptive memory for learning preferences and transferring knowledge across sessions/robots.

Why it matters

This framework overcomes VLM limitations like latency and amnesia, enabling indoor robots to interpret natural language instructions efficiently. It achieves high semantic resolution accuracy and knowledge transfer on low-cost edge hardware, making practical deployment feasible.

Original Abstract

Autonomous indoor mobile robots can navigate reliably to metric coordinates using established frameworks such as ROS 2 Navigation 2, yet they lack the ability to interpret natural language instructions that express intent rather than positions. Vision-Language Models offer the semantic reasoning required to bridge this gap, but their inference latency (2-9 seconds per decision on consumer hardware) and session-by-session amnesia limit practical deployment. This paper presents the Semantic Autonomy Stack, a six-layer reference framework for semantically autonomous indoor navigation, and validates a complete instance featuring hybrid deterministic-VLM reasoning and cross-robot adaptive memory on physical robots with off-the-shelf edge hardware. A seven-step parametric resolver handles 88% of instructions in under 0.1 milliseconds without invoking a language model, camera, or GPU; only genuinely ambiguous instructions escalate to VLM reasoning. A five-category semantic memory framework with explicit scope taxonomy (global environment knowledge, per-operator preferences, per-robot capabilities) enables cross-session learning and cross-robot knowledge transfer: preferences learned through VLM interactions on one robot are promoted to deterministic resolution and transferred to a second robot via a shared compiled digest, achieving a measured latency reduction of 103,000-fold. Experimental validation on two custom-built differential-drive robots across 82 scenario-level decisions and three sessions demonstrates 100% semantic transfer accuracy (33/33, 95% CI [0.894, 1.000]), 100% semantic resolution accuracy, and concurrent multi-robot operation feasibility - all on Raspberry Pi 5 platforms with no onboard GPU, requiring zero training data.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers