Multi-Axis Speech Similarity via Factor-Partitioned Embeddings

May 4, 20262605.02804

eess.AScs.IR

TLDR

This paper introduces factor-partitioned embeddings to disentangle speech attributes like content and speaker identity, enabling multi-axis similarity for improved retrieval.

Key contributions

Introduces factor-partitioned embeddings to disentangle speech attributes (content, speaker, dialect).
Maps utterances into a single vector where subspaces correspond to distinct axes of variation.
Uses a shared acoustic encoder with per-axis projection heads, trained via distillation or contrastive learning.
Enables attribute-conditioned retrieval by computing similarity as a signed weighted sum of per-axis scores.

Why it matters

Traditional speech embeddings conflate multiple attributes, limiting their utility. This framework disentangles these attributes, enabling more precise and flexible speech similarity computations. It significantly improves retrieval by allowing users to explicitly control or suppress specific attributes, addressing common biases like same-speaker bias.

Original Abstract

Speech encodes multiple simultaneous attributes--linguistic content, speaker identity, dialect, gender--that conventional single-vector embeddings conflate. We present a factor-partitioned embedding framework that maps each utterance into a single vector whose subspaces correspond to distinct axes of variation. A shared acoustic encoder feeds per-axis linear projection heads, each trained via distillation from a specialist teacher or a contrastive objective over shared-label pairs. The resulting embeddings support attribute-conditioned retrieval: similarity is computed as a signed weighted sum over per-axis cosine scores, allowing retrieval that jointly considers what was said and how --or explicitly suppresses one attribute to surface another. We evaluate on cross-corpus retrieval over corpora sharing the Harvard sentence prompts, demonstrating that signed axis weighting can suppress same-speaker bias and surface semantically matched utterances across recording conditions.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers