Context Unrolling in Omni Models
Ceyuan Yang, Zhijie Lin, Yang Zhao, Fei Xiao, Hao He + 14 more
TLDR
Omni is a unified multimodal model that uses 'Context Unrolling' to reason across diverse data types, improving performance and generation.
Key contributions
- Omni: A unified multimodal model trained on text, images, videos, 3D geometry, and hidden representations.
- Introduces 'Context Unrolling' for explicit reasoning across diverse modal representations.
- Aggregates complementary information, improving multimodal knowledge approximation and reasoning fidelity.
Why it matters
This paper introduces a novel approach to multimodal AI, enabling models to explicitly reason across diverse data types. The 'Context Unrolling' mechanism allows for more faithful integration of information, pushing the boundaries of unified multimodal understanding and generation.
Original Abstract
We present Omni, a unified multimodal model natively trained on diverse modalities, including text, images, videos, 3D geometry, and hidden representations. We find that such training enables Context Unrolling, where the model explicitly reasons across multiple modal representations before producing predictions. This process enables the model to aggregate complementary information across heterogeneous modalities, facilitating a more faithful approximation of the shared multimodal knowledge manifold and improving downstream reasoning fidelity. As a result, Omni achieves strong performance on both multimodal generation and understanding benchmarks, while demonstrating advanced multimodal reasoning capabilities, including in-context generation of text, image, video, and 3D geometry.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.