Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning

April 24, 20262604.22191

cs.CRcs.CL

TLDR

Introduces "Behavioral Canaries" to audit if private retrieved contexts are used in RL fine-tuning by detecting stylistic changes, not memorization.

Key contributions

Addresses the challenge of auditing private context usage in RL fine-tuning, where RL changes behavior, not facts.
Proposes "Behavioral Canaries" framework by instrumenting preference data with document triggers and stylistic feedback.
Induces a latent trigger-conditioned preference if unauthorized private data is used in training.
Achieves 67% detection at a 10% false-positive rate with only a 1% canary injection rate.

Why it matters

Current auditing methods fail to detect misuse of private data in RL fine-tuning, posing legal and ethical risks. Behavioral Canaries offer a novel, effective way to verify compliance by detecting subtle behavioral shifts. This ensures data privacy and trust in LLM agentic workflows.

Original Abstract

In agentic workflows, LLMs frequently process retrieved contexts that are legally protected from further training. However, auditors currently lack a reliable way to verify if a provider has violated the terms of service by incorporating these data into post-training, especially through Reinforcement Learning (RL). While standard auditing relies on verbatim memorization and membership inference, these methods are ineffective for RL-trained models, as RL primarily influences a model's behavioral style rather than the retention of specific facts. To bridge this gap, we introduce Behavioral Canaries, a new auditing mechanism for RLFT pipelines. The framework instruments preference data by pairing document triggers with feedback that rewards a distinctive stylistic response, inducing a latent trigger-conditioned preference if such data are used in training. Empirical results show that these behavioral signals enable detection of unauthorized document-conditioned training, achieving a 67% detection rate at a 10% false-positive rate (AUROC = 0.756) at a 1% canary injection rate. More broadly, our results establish behavioral canaries as a new auditing mechanism for RLFT pipelines, enabling auditors to test for training-time influence even when such influence manifests as distributional behavioral change rather than memorization.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers