ArXiv TLDR

Train in Vain: Functionality-Preserving Poisoning to Prevent Unauthorized Use of Code Datasets

🐦 Tweet
2604.22291

Yuan Xiao, Jiaming Wang, Yuchen Chen, Wei Song, Jun Sun + 6 more

cs.CRcs.SE

TLDR

FunPoison introduces a functionality-preserving poisoning method to prevent unauthorized use of code datasets for training CodeLLMs, maintaining compilability.

Key contributions

  • Introduces FunPoison, a novel functionality-preserving poisoning method for code datasets.
  • Injects compilable "weak-use" fragments into executed code paths, ensuring side-effect freedom.
  • Achieves effective poisoning with only 10% dataset contamination, maintaining 100% compilability.
  • Robustly defends against advanced code sanitization techniques while preserving functional correctness.

Why it matters

This paper addresses the critical issue of unauthorized code dataset usage for training CodeLLMs. FunPoison offers a practical and stealthy defense mechanism that maintains code functionality, making it a significant advancement in dataset protection. It provides a robust solution for dataset owners to proactively prevent misuse.

Original Abstract

The widespread availability of large-scale code datasets has accelerated the development of code large language models (CodeLLMs), raising concerns about unauthorized dataset usage. Dataset poisoning offers a proactive defense by reducing the utility of such unauthorized training. However, existing poisoning methods often require full dataset poisoning and introduce transformations that break code compilability. In this paper, we introduce FunPoison, a functionality-preserving poisoning approach that injects short, compilable weak-use fragments into executed code paths. FunPoison leverages reusable statement-level templates with automatic repair and conservative safety checking to ensure side-effect freedom, while a type-aware synthesis module suppresses static analysis warnings and enhances stealth. Extensive experiments show that FunPoison achieves effective poisoning by contaminating only 10% of the dataset, while maintaining 100% compilability and functional correctness, and remains robust against various advanced code sanitization techniques.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.