Train in Vain: Functionality-Preserving Poisoning to Prevent Unauthorized Use of Code Datasets
Yuan Xiao, Jiaming Wang, Yuchen Chen, Wei Song, Jun Sun + 6 more
TLDR
FunPoison introduces a functionality-preserving poisoning method to prevent unauthorized use of code datasets for training CodeLLMs, maintaining compilability.
Key contributions
- Introduces FunPoison, a novel functionality-preserving poisoning method for code datasets.
- Injects compilable "weak-use" fragments into executed code paths, ensuring side-effect freedom.
- Achieves effective poisoning with only 10% dataset contamination, maintaining 100% compilability.
- Robustly defends against advanced code sanitization techniques while preserving functional correctness.
Why it matters
This paper addresses the critical issue of unauthorized code dataset usage for training CodeLLMs. FunPoison offers a practical and stealthy defense mechanism that maintains code functionality, making it a significant advancement in dataset protection. It provides a robust solution for dataset owners to proactively prevent misuse.
Original Abstract
The widespread availability of large-scale code datasets has accelerated the development of code large language models (CodeLLMs), raising concerns about unauthorized dataset usage. Dataset poisoning offers a proactive defense by reducing the utility of such unauthorized training. However, existing poisoning methods often require full dataset poisoning and introduce transformations that break code compilability. In this paper, we introduce FunPoison, a functionality-preserving poisoning approach that injects short, compilable weak-use fragments into executed code paths. FunPoison leverages reusable statement-level templates with automatic repair and conservative safety checking to ensure side-effect freedom, while a type-aware synthesis module suppresses static analysis warnings and enhances stealth. Extensive experiments show that FunPoison achieves effective poisoning by contaminating only 10% of the dataset, while maintaining 100% compilability and functional correctness, and remains robust against various advanced code sanitization techniques.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.