ArXiv TLDR

SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

🐦 Tweet
2604.08477

Ashima Suvarna, Kendrick Phan, Mehrab Beikzadeh, Hritik Bansal, Saadia Gabriel

cs.AIcs.LG

TLDR

SUPERNOVA is a data curation framework that uses RL with natural instructions to significantly improve LLM general reasoning by adapting existing instruction-tuning datasets.

Key contributions

  • Introduces SUPERNOVA, a data curation framework for RLVR to boost LLM general reasoning.
  • Adapts expert-annotated instruction-tuning datasets to create verifiable training data.
  • Analyzes data design choices (task selection, mixing, synthetic interventions) via 100+ RL experiments.
  • Achieves up to 52.8% relative improvement on BBEH, outperforming strong baselines like Qwen3.5.

Why it matters

LLMs struggle with general reasoning due to a lack of high-quality, verifiable RL training data. SUPERNOVA addresses this by providing a principled framework for curating human-annotated resources, extending RLVR's benefits to diverse reasoning skills. This offers practical insights for future LLM development.

Original Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved large language model (LLM) reasoning in formal domains such as mathematics and code. Despite these advancements, LLMs still struggle with general reasoning tasks requiring capabilities such as causal inference and temporal understanding. Extending RLVR to general reasoning is fundamentally constrained by the lack of high-quality, verifiable training data that spans diverse reasoning skills. To address this challenge, we propose SUPERNOVA, a data curation framework for RLVR aimed at enhancing general reasoning. Our key insight is that instruction-tuning datasets containing expert-annotated ground-truth encode rich reasoning patterns that can be systematically adapted for RLVR. To study this, we conduct 100+ controlled RL experiments to analyze how data design choices impact downstream reasoning performance. In particular, we investigate three key factors: (i) source task selection, (ii) task mixing strategies, and (iii) synthetic interventions for improving data quality. Our analysis reveals that source task selection is non-trivial and has a significant impact on downstream reasoning performance. Moreover, selecting tasks based on their performance for individual target tasks outperforms strategies based on overall average performance. Finally, models trained on SUPERNOVA outperform strong baselines (e.g., Qwen3.5) on challenging reasoning benchmarks including BBEH, Zebralogic, and MMLU-Pro. In particular, training on SUPERNOVA yields relative improvements of up to 52.8\% on BBEH across model sizes, demonstrating the effectiveness of principled data curation for RLVR. Our findings provide practical insights for curating human-annotated resources to extend RLVR to general reasoning. The code and data is available at https://github.com/asuvarna31/supernova.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.