Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling

May 7, 20262605.05711

Anh H. Vo, Sungyo Lee, Phil-Joong Kim, Soo-Mi Choi, Yong-Guk Kim

cs.CVcs.GRcs.HCcs.LGcs.MM

TLDR

This paper unifies language-driven 3D scene generation with immersive user interaction using LLMs and RL, enabling adaptive VR experiences.

Key contributions

Presents a unified framework for language-driven 3D scene generation and immersive user interaction.
Uses LLMs to construct scene representations and RL to optimize spatial layouts with constraints.
Deploys generated environments in VR for HRI-in-the-loop, using user feedback to refine content.
Achieves state-of-the-art performance on ALFRED and improves immersion and interaction quality.

Why it matters

This paper addresses a key limitation in 3D content generation by integrating user interaction into the generation loop. It significantly enhances the adaptability and realism of virtual environments. This approach paves the way for more immersive and responsive next-generation multimedia systems.

Original Abstract

Recent advances in large language models (LLMs) have significantly improved language-driven 3D content generation, but most existing approaches still treat scene generation and user interaction as separate processes, limiting the adaptability and immersive potential of interactive multimedia systems. This paper presents a unified framework that closes the loop between language-driven 3D scene generation and immersive user interaction. Given natural language instructions, the system first constructs structured scene representations using LLMs, and then optimizes spatial layouts via reinforcement learning under geometric and semantic constraints. The generated environments are deployed in a virtual reality setting to facilitate HRI-in-the-loop, where user interactions provide continuous feedback to align generated content with human perception and usability. By tightly coupling generation and interaction, the proposed framework enables more responsive, adaptive, and realistic multimedia experiences. Experiments on the ALFRED benchmark demonstrate state-of-the-art performance in task-based scene generation. Furthermore, qualitative results and user studies show consistent improvements in immersion, interaction quality, and task efficiency, highlighting the importance of closed-loop integration of generation and interaction for next-generation multimedia systems. Our project page can be found at https://proj-showcase.github.io/h3ds/.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers