Detecting and Suppressing Reward Hacking with Gradient Fingerprints
Songtao Wang, Quang Hieu Pham, Fangcong Yin, Xinpeng Wang, Jocelyn Qiaochu Chen + 2 more
TLDR
GRIFT uses gradient fingerprints to detect and suppress reward hacking in RL models, outperforming baselines and improving task performance.
Key contributions
- Proposes GRIFT, a novel method using gradient fingerprints to detect reward hacking in RL models.
- Computes and compresses gradients of Chain-of-Thought (CoT) to identify hacking behaviors.
- Outperforms strong baselines by over 25% in detecting reward hacking across various tasks.
- Integrates with rejection fine-tuning to reduce hacking and improve true task objective performance.
Why it matters
Reward hacking is a critical problem in RL, where models exploit loopholes without solving the intended task. GRIFT offers a robust, internal computation-based solution that significantly improves detection and mitigation. This advances the reliability and trustworthiness of verifiable reasoning systems.
Original Abstract
Reinforcement learning with verifiable rewards (RLVR) typically optimizes for outcome rewards without imposing constraints on intermediate reasoning. This leaves training susceptible to reward hacking, where models exploit loopholes (e.g., spurious patterns in training data) in the reward function to achieve high scores without solving the intended task. These reward-hacking behaviors are often implicit, as the intermediate chain-of-thought (CoT) may appear plausible on the surface, limiting the effectiveness of purely text-based monitoring. We propose Gradient Fingerprint (GRIFT), a method for detecting reward hacking using models' internal computations. Given a prompt and a model-generated CoT, GRIFT computes gradients of the CoT conditioned on the prompt and compresses them into a compact representation, which is then used to assess whether the CoT reflects reward hacking behavior. Across verifiable reasoning benchmarks spanning math, code, and logical reasoning, GRIFT substantially outperforms strong baselines, including CoT Monitor and TRACE, achieving over 25% relative improvement in detecting reward hacking behavior. Moreover, integrating GRIFT into the rejection fine-tuning pipeline for reasoning tasks reduces reward hacking and improves performance on the true task objective. Our results highlight a promising direction of leveraging gradient level representations for assessing the quality of CoT reasoning traces. Our code is available at: https://github.com/songtao-x/reward_hack.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.