Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL
Zhaofeng Wu, Shiqi Wang, Boya Peng, Anuj Goyal, Melanie Kambadur + 3 more
TLDR
Parallel-SFT improves zero-shot cross-programming-language transfer for code RL by using parallel programs in SFT, leading to better generalization to unseen PLs.
Key contributions
- Standard RL training for code in one PL often degrades zero-shot transfer to others.
- Introduces Parallel-SFT, an SFT method using "parallel programs" (multi-PL equivalent code) for initialization.
- Demonstrates Parallel-SFT improves generalization to unseen programming languages after subsequent RL.
- Shows Parallel-SFT creates a functionality-centric latent space, aiding cross-language transfer.
Why it matters
This paper addresses a key limitation of large language models in code generation for low-resource programming languages. By proposing Parallel-SFT, it offers a novel approach to enable effective zero-shot transfer of coding skills across different languages. This could significantly broaden the applicability of code-generating LLMs.
Original Abstract
Modern language models demonstrate impressive coding capabilities in common programming languages (PLs), such as C++ and Python, but their performance in lower-resource PLs is often limited by training data availability. In principle, however, most programming skills are universal across PLs, so the capability acquired in one PL should transfer to others. In this work, we propose the task of zero-shot cross-programming-language transfer for code RL. We find that, for Llama-3.1, RL training for code generation in a source PL fails to improve, and sometimes even degrades, the performance on other target PLs. To address this, we hypothesize that effective RL transfer requires a generalizable SFT initialization before RL. We thus propose **Parallel-SFT**, an SFT strategy that incorporates "parallel programs" -- functionally equivalent code implemented in multiple PLs -- into the data mixture. We demonstrate that this improves transferability: when we subsequently perform RL on our Parallel-SFT model, we observe better generalization to unseen PLs. Analysis of the model internal representations reveals that Parallel-SFT leads to a more functionality-centric latent space, where equivalent programs across PLs are more tightly clustered, which we hypothesize to contribute to the improved transferability.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.