Diffusion Model as a Generalist Segmentation Learner

April 27, 20262604.24575

Haoxiao Wang, Antao Xiang, Haiyang Sun, Peilin Sun, Changhao Pan + 6 more

cs.CV

TLDR

DiGSeg repurposes diffusion models for versatile, text-conditioned segmentation across diverse domains without custom architectures.

Key contributions

Introduces DiGSeg, a unified segmentation framework using pretrained diffusion models.
Combines image, mask, and CLIP-aligned text features for multi-scale segmentation conditioning.
Achieves state-of-the-art semantic segmentation and open-vocabulary generalization.
Demonstrates strong cross-domain transfer to medical, remote sensing, and agricultural tasks.

Why it matters

This paper shows diffusion models can unify segmentation tasks with text conditioning, bridging generation and understanding. It enables flexible, domain-agnostic segmentation without redesigning architectures.

Original Abstract

Diffusion models are primarily trained for image synthesis, yet their denoising trajectories encode rich, spatially aligned visual priors. In this paper, we demonstrate that these priors can be utilized for text-conditioned semantic and open-vocabulary segmentation, and this approach can be generalized to various downstream tasks to make a general-purpose diffusion segmentation framework. Concretely, we introduce DiGSeg (Diffusion Models as a Generalist Segmentation Learner), which repurposes a pretrained diffusion model into a unified segmentation framework. Our approach encodes the input image and ground-truth mask into the latent space and concatenates them as conditioning signals for the diffusion U-Net. A parallel CLIP-aligned text pathway injects language features across multiple scales, enabling the model to align textual queries with evolving visual representations. This design transforms an off-the-shelf diffusion backbone into a universal interface that produces structured segmentation masks conditioned on both appearance and arbitrary text prompts. Extensive experiments demonstrate state-of-the-art performance on standard semantic segmentation benchmarks, as well as strong open-vocabulary generalization and cross-domain transfer to medical, remote sensing, and agricultural scenarios-without domain-specific architectural customization. These results indicate that modern diffusion backbones can serve as generalist segmentation learners rather than pure generators, narrowing the gap between visual generation and visual understanding.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers