Mixed-Precision Information Bottlenecks for On-Device Trait-State Disentanglement in Bipolar Agitation Detection
TLDR
MP-IB uses mixed-precision quantization as an information bottleneck to disentangle speaker traits from agitation states for on-device bipolar monitoring.
Key contributions
- Introduces MP-IB, a novel framework using mixed-precision quantization for trait-state disentanglement.
- Leverages FP16 for speaker identity and INT4 for agitation, creating 8x information asymmetry without adversarial training.
- Achieves state-of-the-art performance for bipolar agitation detection on Bridge2AI-Voice and zero-shot transfer.
- Enables real-time, on-device monitoring with 23.4ms latency and a 617KB footprint on low-cost devices.
Why it matters
This paper introduces a highly efficient and accurate method for monitoring bipolar disorder agitation on resource-constrained devices. By effectively disentangling speaker traits from emotional states, it provides a crucial tool for continuous, private, and real-time clinical assessment. Its low resource footprint makes it practical for widespread use.
Original Abstract
Continuous monitoring of bipolar disorder agitation via voice biomarkers requires disentangling stable speaker traits from volatile affective states on resource-constrained edge devices. We introduce MP-IB, the first framework to treat mixed-precision quantization as an information bottleneck for clinical trait-state separation. The core insight is that numerical precision itself controls capacity: an FP16 trait head (1,024 bits) encodes speaker identity, while an INT4 state head (128 bits) captures agitation, yielding 8x information asymmetry without adversarial training. We augment this with Dynamic Precision Scheduling and Multi-Scale Temporal Fusion. On Bridge2AI-Voice (N=833, 4 sessions/participant, strict speaker-independent CV), MP-IB achieves rho = 0.117 (95\% CI: [0.089, 0.145], p=0.003 vs. chance), outperforming 94M-parameter WavLM-Adapter with in-domain SSL continuation (rho = -0.042), beta VAE disentanglement (rho = 0.089), and hand-crafted prosody (rho = 0.031) by 2.8--15.9 points absolute. Zero-shot transfer to CREMA-D achieves AUC=0.817. Identity leakage is suppressed to near-random (EER=0.42, MIA-AUC=0.52). End-to-end latency is 23.4 ms with a 617 KB footprint, enabling real-time monitoring on sub 20 dollar devices.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.