Joint-Centric Dual Contrastive Alignment with Structure-Preserving and Information-Balanced Regularization
Habibeh Naderi, Behrouz Haji Soleimani, Stan Matwin
TLDR
HILBERT is a cross-attentive framework for balanced audio-text representation learning from long, segmented sequences in low-resource settings.
Key contributions
- Proposes HILBERT, a cross-attentive framework for document-level audio-text representation learning.
- Introduces a reciprocal dual contrastive objective for joint-centric audio-text alignment.
- Uses Centered Kernel Alignment (CKA) loss to preserve modality-specific structural consistency.
- Employs mutual information balancing to prevent single-modality dominance in joint space.
Why it matters
This paper addresses challenges in learning multimodal representations from long, low-resource audio-text data. HILBERT's novel alignment and regularization techniques effectively handle dimensional imbalance and prevent modality dominance, leading to superior performance.
Original Abstract
We propose HILBERT (HIerarchical Long-sequence Balanced Embedding with Reciprocal contrastive Training), a cross-attentive multimodal framework for learning document-level audio-text representations from long, segmented sequences in low-resource data settings. HILBERT leverages frozen pre-trained speech and language encoders to extract segment-level features, which are aggregated via cross-modal attention and self-attentive pooling to form modality-specific document representations and a joint cross-attentive embedding. To align modalities while preserving modality-specific structure under severe audio-text dimensional imbalance, we introduce a reciprocal dual contrastive objective that simultaneously aligns audio-to-joint and text-to-joint representations, rather than directly contrasting audio and text alone. Two auxiliary regularizers further stabilize long-sequence fusion: a Centered Kernel Alignment (CKA) loss that preserves structural consistency between each modality and the joint embedding, and a mutual information balancing loss that prevents dominance of a single modality by equalizing information flow from audio and text into the joint space. For downstream prediction, HILBERT employs a Mixture-of-Experts (MoE) classifier over concatenated audio, text, and joint representations to accommodate heterogeneous label regimes. Extensive evaluation across multiple audio-text backbone combinations demonstrates that HILBERT learns semantically meaningful long-sequence representations and achieves superior performance on highly imbalanced multi-class settings.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.