ArXiv TLDR

Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model

🐦 Tweet
2604.19635

Shuhai Peng, Hui Lu, Jinjiang Liu, Liyang Chen, Guiping Zhong + 6 more

cs.SDcs.AI

TLDR

Introduces the first autoregressive models for streaming Target Speaker Extraction, achieving stable, efficient real-time performance comparable to offline methods.

Key contributions

  • Presents the first autoregressive (AR) models specifically designed for streaming Target Speaker Extraction (TSE).
  • Introduces a Chunk-wise Interleaved Splicing Paradigm for highly efficient and stable streaming inference.
  • Uses a historical context refinement mechanism to ensure coherence and mitigate boundary discontinuities.
  • Achieves 100% stability, superior intelligibility, and an RTF of 0.248, outperforming offline baselines.

Why it matters

Generative models for Target Speaker Extraction (TSE) struggle in real-time. This paper makes AR generative models viable for latency-sensitive applications by introducing a novel streaming paradigm. It demonstrates stable, efficient real-time TSE performance, matching or exceeding offline methods.

Original Abstract

While generative models have set new benchmarks for Target Speaker Extraction (TSE), their inherent reliance on global context precludes deployment in real-time applications. Direct adaptation to streaming scenarios often leads to catastrophic inference performance degradation due to the severe mismatch between training and streaming inference. To bridge this gap, we present the first autoregressive (AR) models tailored for streaming TSE. Our approach introduces a Chunk-wise Interleaved Splicing Paradigm that ensures highly efficient and stable streaming inference. To ensure the coherence between the extracted speech segments, we design a historical context refinement mechanism that mitigates boundary discontinuities by leveraging historical information. Experiments on Libri2Mix show that while AR generative baseline exhibits performance degradation at low latencies, our approach maintains 100% stability and superior intelligibility. Furthermore, our streaming results are comparable to or even surpass offline baselines. Additionally, our model achieves a Real-Time-Factor (RTF) of 0.248 on consumer-level GPUs. This work provides empirical evidence that AR generative backbones are viable for latency-sensitive applications through the Chunk-wise Interleaved Splicing Paradigm.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.