ArXiv TLDR

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

🐦 Tweet
2605.00754

Indraneil Paul, Glavaš Glavas, Iryna Gurevych

cs.SEcs.LG

TLDR

Themis introduces multilingual, multi-criteria code reward models trained on a new large preference dataset to improve code generation beyond functional correctness.

Key contributions

  • Developed Themis-CodeRewardBench, a benchmark for evaluating code RMs across 5 criteria and 8 languages.
  • Created Themis-CodePreference, the largest open-source dataset of 350k+ code preference pairs.
  • Trained Themis-RM, a suite of multilingual, multi-criteria code reward models (600M-32B params).
  • Demonstrated strong cross-lingual transfer and the importance of multi-criteria training for code RMs.

Why it matters

Current code reward models primarily focus on functional correctness, limiting their utility. This work addresses that gap by enabling flexible, multi-criteria scoring for code generation. This advancement is crucial for developing more sophisticated and human-aligned code LMs.

Original Abstract

Reward models (RMs) have become an indispensable fixture of the language model (LM) post-training playbook, enabling policy alignment and test-time scaling. Research on the application of RMs in code generation, however, has been comparatively sparse, with existing work largely focusing on execution feedback. This choice constrains post-training to optimizing functional correctness over self-contained executable code. In this work, we examine the training and evaluation of multilingual, multi-criteria code RMs. To this end, we first compile Themis-CodeRewardBench, a benchmark to evaluate code RMs across five preference dimensions (i.e., criteria) and eight programming languages, on which we profile 50+ code, math, and general-purpose RMs. Observing the limited proficiency of current RMs beyond scoring for functional correctness, we develop Themis-CodePreference, the largest open-source collection of code preferences to date (more than 350k preference pairs), and use it to train Themis-RM, a suite of multilingual code reward models for flexible multi-criteria scoring, ranging in size from 600M to 32B parameters. Our experiments and ablations demonstrate positive scaling trends, strong cross-lingual transfer when training on diverse preferences, and the importance of multi-criteria training for reliable code reward modeling.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.