ArXiv TLDR

R-CoT: A Reasoning-Layer Watermark via Redundant Chain-of-Thought in Large Language Models

🐦 Tweet
2604.25247

Ziming Zhang, Li Li, Guorui Feng, Hanzhou Wu, Xinpeng Zhang

cs.CR

TLDR

R-CoT introduces a robust reasoning-layer watermark for LLMs by embedding ownership directly into the model's stable thought process, resisting removal.

Key contributions

  • Introduces R-CoT, a reasoning-layer watermark embedded directly into LLM's Chain-of-Thought.
  • Utilizes a dual-trajectory GRPO optimization for native and watermark reasoning paths.
  • Internalizes the watermark as a distinct, stable reasoning policy within the model.
  • Achieves over 95% true positive rate even after fine-tuning and post-training operations.

Why it matters

Existing LLM watermarks are easily removed by superficial output modifications. R-CoT overcomes this by embedding watermarks into the stable reasoning path, ensuring robust ownership. This is crucial for preventing misuse and verifying the origin of LLM-generated content.

Original Abstract

Large language models (LLMs) are widely deployed in multiple scenarios due to reasoning capabilities. In order to prevent the models from being misused, watermarking is generally employed to ensure ownership. However, most existing watermarking methods rely on superficial modifications to the model's output distribution, rendering the watermark vulnerable to perturbation and removal. To overcome this challenge, this paper introduces a reasoning-layer framework termed Redundant Chain-of-Thought (R-CoT), which embeds watermarks into the reasoning path. A dual-trajectory optimization mechanism based on GRPO enables the native and the watermark reasoning path to coexist within a shared parameter space, internalizing the watermark as a distinct reasoning policy. Therefore, the watermark is embedded into the model's stable reasoning path, avoiding the watermark failure caused by output-level perturbations. Experimental results show that, compared with existing methods, R-CoT achieves high watermark effectiveness and strong robustness. Under fine-tuning and other post-training operations, the true positive rate (TPR) consistently remains above 95%, exhibiting only marginal degradation.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.