Scaling Instructable Agents Across Many Simulated Worlds

March 13, 20242404.10179

SIMA Team, Maria Abi Raad, Arun Ahuja, Catarina Barros, Frederic Besse + 89 more

cs.ROcs.AIcs.HCcs.LG

TLDR

This paper presents SIMA, an instructable embodied AI agent trained to follow free-form language instructions across diverse 3D simulated environments using a generic keyboard-and-mouse interface.

Key contributions

Introduces SIMA, an agent capable of grounding language in perception and embodied actions across multiple virtual 3D worlds.
Employs a general, human-like interface (image observations and keyboard-mouse actions) enabling real-time interaction in varied environments.
Demonstrates preliminary success in both curated research settings and complex commercial video games, highlighting broad applicability.

Why it matters

Creating AI agents that can understand and execute arbitrary language instructions in any 3D environment is crucial for advancing general AI. By focusing on language-driven generality and minimal assumptions, SIMA pushes the boundaries of embodied AI, enabling agents to operate flexibly across a wide range of visually and semantically diverse worlds. This work lays foundational steps toward AI systems that can seamlessly adapt to new, complex environments much like humans do.

Original Abstract

Building embodied AI systems that can follow arbitrary language instructions in any 3D environment is a key challenge for creating general AI. Accomplishing this goal requires learning to ground language in perception and embodied actions, in order to accomplish complex tasks. The Scalable, Instructable, Multiworld Agent (SIMA) project tackles this by training agents to follow free-form instructions across a diverse range of virtual 3D environments, including curated research environments as well as open-ended, commercial video games. Our goal is to develop an instructable agent that can accomplish anything a human can do in any simulated 3D environment. Our approach focuses on language-driven generality while imposing minimal assumptions. Our agents interact with environments in real-time using a generic, human-like interface: the inputs are image observations and language instructions and the outputs are keyboard-and-mouse actions. This general approach is challenging, but it allows agents to ground language across many visually complex and semantically rich environments while also allowing us to readily run agents in new environments. In this paper we describe our motivation and goal, the initial progress we have made, and promising preliminary results on several diverse research environments and a variety of commercial video games.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers