ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

April 10, 20262604.09450

Lifeng Chen, Tianqi You, Hao Liu, Zhimin Bao, Jile Jiao + 6 more

cs.LGcs.AIeess.IV

TLDR

ECHO is a new diffusion-based VLM for chest X-ray report generation that achieves 8x faster inference with one-step block diffusion.

Key contributions

Proposes ECHO, an efficient diffusion VLM for chest X-ray report generation.
Introduces Direct Conditional Distillation (DCD) for stable one-step-per-block inference, mitigating mean-field bias.
Employs Response-Asymmetric Diffusion (RAD) for improved training efficiency and effectiveness.

Why it matters

This paper introduces ECHO, a novel diffusion VLM that significantly speeds up chest X-ray report generation. By enabling one-step inference, it addresses the high latency of current methods, potentially alleviating radiologists' workload. Its 8x speedup without accuracy loss makes it a practical advancement for clinical settings.

Original Abstract

Chest X-ray report generation (CXR-RG) has the potential to substantially alleviate radiologists' workload. However, conventional autoregressive vision--language models (VLMs) suffer from high inference latency due to sequential token decoding. Diffusion-based models offer a promising alternative through parallel generation, but they still require multiple denoising iterations. Compressing multi-step denoising to a single step could further reduce latency, but often degrades textual coherence due to the mean-field bias introduced by token-factorized denoisers. To address this challenge, we propose \textbf{ECHO}, an efficient diffusion-based VLM (dVLM) for chest X-ray report generation. ECHO enables stable one-step-per-block inference via a novel Direct Conditional Distillation (DCD) framework, which mitigates the mean-field limitation by constructing unfactorized supervision from on-policy diffusion trajectories to encode joint token dependencies. In addition, we introduce a Response-Asymmetric Diffusion (RAD) training strategy that further improves training efficiency while maintaining model effectiveness. Extensive experiments demonstrate that ECHO surpasses state-of-the-art autoregressive methods, improving RaTE and SemScore by \textbf{64.33\%} and \textbf{60.58\%} respectively, while achieving an \textbf{$8\times$} inference speedup without compromising clinical accuracy.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers