ArXiv TLDR

Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation

🐦 Tweet
2604.02289

Chongjie Ye, Cheng Cao, Chuanyu Pan, Yiming Hao, Yihao Zhi + 2 more

cs.CVcs.AI

TLDR

Omni123 is a 3D-native foundation model that unifies text-to-2D and 3D generation, leveraging 2D data to improve 3D representations despite limited 3D data.

Key contributions

  • Unifies text-to-2D and text-to-3D generation within a single autoregressive 3D-native foundation model.
  • Leverages cross-modal consistency between 2D images and 3D as an implicit structural constraint.
  • Introduces an interleaved X-to-X training paradigm using heterogeneous datasets, avoiding full 3D alignment.
  • Significantly improves text-guided 3D generation and editing by enforcing multi-view geometric consistency.

Why it matters

This paper presents a novel approach to overcome the scarcity of 3D data by unifying 2D and 3D generation. It offers a scalable path for building multimodal 3D world models, significantly improving 3D content creation from text.

Original Abstract

Recent multimodal large language models have achieved strong performance in unified text and image understanding and generation, yet extending such native capability to 3D remains challenging due to limited data. Compared to abundant 2D imagery, high-quality 3D assets are scarce, making 3D synthesis under-constrained. Existing methods often rely on indirect pipelines that edit in 2D and lift results into 3D via optimization, sacrificing geometric consistency. We present Omni123, a 3D-native foundation model that unifies text-to-2D and text-to-3D generation within a single autoregressive framework. Our key insight is that cross-modal consistency between images and 3D can serve as an implicit structural constraint. By representing text, images, and 3D as discrete tokens in a shared sequence space, the model leverages abundant 2D data as a geometric prior to improve 3D representations. We introduce an interleaved X-to-X training paradigm that coordinates diverse cross-modal tasks over heterogeneous paired datasets without requiring fully aligned text-image-3D triplets. By traversing semantic-visual-geometric cycles (e.g., text to image to 3D to image) within autoregressive sequences, the model jointly enforces semantic alignment, appearance fidelity, and multi-view geometric consistency. Experiments show that Omni123 significantly improves text-guided 3D generation and editing, demonstrating a scalable path toward multimodal 3D world models.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.