On the Hardness of Junking LLMs
TLDR
This paper explores "junking" LLMs, finding that natural backdoors, while harder to find than jailbreaks, are easily recoverable with simple search.
Key contributions
- Investigates "junking," eliciting harmful LLM outputs using only optimized token sequences (natural backdoors).
- Formalizes junking as finding sequences maximizing harmful prefix probability and proposes a greedy random search.
- Shows junking is harder than traditional jailbreaks but easily solvable with a simple search method.
- Discovered token sequences lie in low-probability regions, suggesting implicit emergence during training.
Why it matters
This work highlights a new vulnerability in LLMs: "natural backdoors" that can trigger harmful outputs without explicit prompts. It suggests that even simple methods can uncover these implicit weaknesses, posing a significant challenge for LLM safety and alignment.
Original Abstract
Large language models (LLMs) are known to be vulnerable to jailbreak attacks, which typically rely on carefully designed prompts containing explicit semantic structure. These attacks generally operate by fixing an adversarial instruction and optimizing small adversarial components (e.g., suffixes or prefixes). In this setting, prompt structure is fundamental for performance, and recent results show that even simple random search can achieve strong performance when combined with sophisticated prompt design. Recently, it has been observed that harmful behaviors can be elicited even without the adversarial prompt, relying solely on optimized token sequences. This suggests the existence of natural backdoors, i.e., token sequences naturally emerged during LLMs training that trigger unsafe outputs without any meaningful instruction. However, despite these observations, this setting remains largely unexplored, and in particular the hardness of finding natural backdoors has not been assessed yet. In this work, we provide a first proof-of-concept study investigating the hardness of this task, which we refer to as the junking problem. We formalize it as the problem of finding token sequences that maximize the probability of generating a target prefix of harmful responses, propose a greedy random-search method to assess is such sequences can be discovered easily. Our results show that this problem is harder than standard jailbreak attacks, confirming the importance of semantic information in prompt design. At the same time, we find that our simple strategy is sufficient to solve it with a high success rate, suggesting that natural backdoors are present and easily recoverable. Finally, through perplexity analysis, we observe that the discovered token sequences lie in low-probability regions of the model distribution, supporting the hypothesis that they emerged implicitly from the training process.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.