SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

May 12, 20262605.12500

Haiwen Diao, Penghao Wu, Hanming Deng, Jiahao Wang, Shihao Bai + 53 more

cs.CV

TLDR

SenseNova-U1 introduces a unified architecture (NEO-unify) that seamlessly integrates multimodal understanding and generation, outperforming specialized VLMs.

Key contributions

Introduces SenseNova-U1 with NEO-unify, a unified architecture for multimodal understanding and generation.
Two models (8B-MoT, 30B-A3B-MoT) rival top understanding-only VLMs across diverse tasks.
Delivers strong X2I synthesis, complex infographic, and interleaved vision-language generation.
Shows strong performance in vision-language-action (VLA) and world model scenarios.

Why it matters

This paper addresses a fundamental limitation in current VLMs by unifying understanding and generation into a single process. By demonstrating superior performance across a wide range of tasks, SenseNova-U1 paves the way for truly native multimodal AI. Its potential in VLA and world models suggests a significant leap towards more integrated and intelligent systems.

Original Abstract

Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of native multimodal intelligence. Hence, we introduce SenseNova-U1, a native unified multimodal paradigm built upon NEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. We launch two native unified variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, built on dense (8B) and mixture-of-experts (30B-A3B) understanding baselines, respectively. Designed from first principles, they rival top-tier understanding-only VLMs across text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. Meanwhile, they deliver strong semantic consistency and visual fidelity, excelling in conventional or knowledge-intensive any-to-image (X2I) synthesis, complex text-rich infographic generation, and interleaved vision-language generation, with or without think patterns. Beyond performance, we show detailed model design, data preprocessing, pre-/post-training, and inference strategies to support community research. Last but not least, preliminary evidence demonstrates that our models extend beyond perception and generation, performing strongly in vision-language-action (VLA) and world model (WM) scenarios. This points toward a broader roadmap where models do not translate between modalities, but think and act across them in a native manner. Multimodal AI is no longer about connecting separate systems, but about building a unified one and trusting the necessary capabilities to emerge from within.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers