Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

May 6, 20262605.04700

Zheng Fang, Xiaosen Wang, Shenyi Zhang, Shaokang Wang, Zhijin Ge

cs.CRcs.AIcs.CLcs.LGcs.SD

TLDR

This paper shows that sparse, token-aware gradient optimization can effectively jailbreak Audio Language Models, proving dense updates are redundant.

Key contributions

Gradient energy in ALMs is highly non-uniform across audio tokens, with only a subset dominating.
Introduces Token-Aware Gradient Optimization (TAGO) for sparse jailbreak attacks on ALMs.
TAGO selectively updates waveform gradients aligned with high-energy audio tokens.
Achieves strong attack success rates (e.g., 86% ASR) with substantial token sparsification (0.25 retention).

Why it matters

This work demonstrates that current dense jailbreak methods for ALMs are inefficient. By showing sparse updates are sufficient, it opens avenues for more efficient attack and defense strategies. This understanding is crucial for advancing ALM safety and alignment research.

Original Abstract

Jailbreak attacks on audio language models (ALMs) optimize audio perturbations to elicit unsafe generations, and they typically update the entire waveform densely throughout optimization. In this work, we investigate the necessity of such dense optimization by analyzing the structure of token-aligned gradients in ALMs. We find that gradient energy is highly non-uniform across audio tokens, indicating that only a small subset of token-aligned audio regions dominates the optimization signal. Motivated by this observation, we propose Token-Aware Gradient Optimization (TAGO), which enables sparse jailbreak optimization by retaining only waveform gradients aligned with audio tokens that have high gradient energy, while masking the remaining gradients at each iteration. Across three ALMs, TAGO outperforms baselines, and substantial sparsification preserves strong attack success rates (e.g. on Qwen3-Omni, $\mathrm{ASR}_{l}$ remains at 86% with a token retention ratio of 0.25, compared to 87% with full token retention). These results demonstrate that dense waveform updates are largely redundant, and we advocate that future audio jailbreak and safety alignment research should further leverage this heterogeneous token-level gradient structure.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers