ArXiv TLDR

Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory

🐦 Tweet
2604.12817

Shaopeng Fu, Di Wang

cs.LGcs.CRstat.ML

TLDR

This paper theoretically explains Continuous Adversarial Training (CAT) for LLMs using in-context learning and proposes an improved regularization method.

Key contributions

  • Provides the first theoretical analysis of Continuous Adversarial Training (CAT) for LLMs using in-context learning theory.
  • Proves a robust generalization bound for CAT, explaining its defense against token-space jailbreaks via embedding perturbations.
  • Identifies a link between LLM robustness and the singular values of its embedding matrix.
  • Proposes a novel regularization term based on embedding matrix singular values to enhance CAT's robustness-utility tradeoff.

Why it matters

Continuous Adversarial Training (CAT) is a promising defense for LLMs against jailbreak attacks, but its mechanism was a black box. This paper provides a foundational theoretical understanding, explaining how embedding-space perturbations improve robustness against token-space attacks. It also offers a practical method to enhance CAT, leading to more robust and efficient LLM defenses.

Original Abstract

Adversarial training (AT) is an effective defense for large language models (LLMs) against jailbreak attacks, but performing AT on LLMs is costly. To improve the efficiency of AT for LLMs, recent studies propose continuous AT (CAT) that searches for adversarial inputs within the continuous embedding space of LLMs during AT. While CAT has achieved empirical success, its underlying mechanism, i.e., why adversarial perturbations in the embedding space can help LLMs defend against jailbreak prompts synthesized in the input token space, remains unknown. This paper presents the first theoretical analysis of CAT on LLMs based on in-context learning (ICL) theory. For linear transformers trained with adversarial examples from the embedding space on in-context linear regression tasks, we prove a robust generalization bound that has a negative correlation with the perturbation radius in the embedding space. This clearly explains why CAT can defend against jailbreak prompts from the LLM's token space. Further, the robust bound shows that the robustness of an adversarially trained LLM is closely related to the singular values of its embedding matrix. Based on this, we propose to improve LLM CAT by introducing an additional regularization term, which depends on singular values of the LLM's embedding matrix, into the objective function of CAT. Experiments on real-world LLMs demonstrate that our method can help LLMs achieve a better jailbreak robustness-utility tradeoff. The code is available at https://github.com/fshp971/continuous-adv-icl.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.