HEART: Hyperspherical Embedding Alignment via Kent-Representation Traversal in Diffusion Models
Arani Roy, Shristi Das Biswas, Kaushik Roy
TLDR
HEART uses hyperspherical embeddings and Kent distributions to enable precise, training-free control over text-to-image diffusion models, preserving scene details.
Key contributions
- Introduces HEART, a training-free framework for diffusion model control.
- Leverages hyperspherical text embeddings and Kent distributions for semantic understanding.
- Enables precise subject replacement and fine-grained attribute control without scene alteration.
- Generalizes across various diffusion model architectures without finetuning or inversion.
Why it matters
Existing text-to-image control methods struggle with unintended side effects due to treating embedding space as Euclidean. This paper reveals the true hyperspherical geometry of text embeddings, allowing HEART to perform intuitive and precise edits. This shift in perspective unlocks fast, controllable image generation, addressing a major challenge in diffusion models.
Original Abstract
Text-to-image diffusion models can generate visually stunning images, yet, controlling what appears and how it appears, remains surprisingly difficult, especially when operating solely within the constraints of the text-conditioning space. For example, changing a subject or adjusting an attribute often leads to unintended side effects, such as altered backgrounds or distorted details. This is because most existing text-based control methods treat the embedding space as Euclidean and apply simple linear transformations, which do not reflect how semantic concepts are actually organized. In this work, we take a step back and ask: what is the true geometry of these embeddings? We find that text encoder representations lie on a hypersphere, where concepts are not linear directions but structured, anisotropic distributions better captured by Kent distributions. Building on this insight, we propose HEART, a training-free framework that performs Kent-aware geodesic transformations directly on the hypersphere. By respecting the underlying geometry, HEART enables intuitive and precise edits, such as consistent subject replacement and fine-grained attribute control, while preserving the original scene. Importantly, HEART requires no finetuning, inversion, or optimization, and generalizes across diffusion model architectures. Our results show that a simple shift in perspective, from linear to spherical, can unlock fast, and controllable image generation.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.