Understanding Secret Leakage Risks in Code LLMs: A Tokenization Perspective

April 20, 20262604.17814

Meifang Chen, Zhe Yang, Huang Nianchen, Yizhan Huang, Yichen Li + 2 more

cs.CRcs.AI

TLDR

This paper reveals how BPE tokenization creates a "gibberish bias" in Code LLMs, making certain high-entropy secrets easier to memorize and leak.

Key contributions

Reveals "gibberish bias" in Code LLMs, where BPE tokenization makes secrets easier to memorize.
Identifies secrets with high char-level but low token-level entropy are most vulnerable to leakage.
Shows this bias originates from token distribution shifts between training and secret data.
Discusses implications for larger vocabularies and proposes mitigation strategies.

Why it matters

This research is crucial for understanding and mitigating secret leakage in AI code assistants. It highlights a fundamental flaw in current tokenization methods that exacerbates cybersecurity risks. The findings inform safer tokenizer design and training practices for Code LLMs.

Original Abstract

Code secrets are sensitive assets for software developers, and their leakage poses significant cybersecurity risks. While the rapid development of AI code assistants powered by Code Large Language Models (CLLMs), CLLMs are shown to inadvertently leak such secrets due to a notorious memorization phenomenon. This study first reveals that Byte-Pair Encoding (BPE) tokenization leads to unexpected behavior of secret memorization, which we term as \textit{gibberish bias}. Specifically, we identified that some secrets are among the easiest for CLLMs to memorize. These secrets yield high character-level entropy, but low token-level entropy. Then, this paper supports the biased claim with numerical data. We identified that the roots of the bias are the token distribution shift between the CLLM training data and the secret data. We further discuss how gibberish bias manifests under the ``larger vocabulary'' trend. To conclude the paper, we discuss potential mitigation strategies and the broader implications on current tokenizer design.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers