Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization

April 14, 20262604.13175

Aadyot Bhatnagar, Peter Mørch Groth, Ali Madani

cs.LGcs.AIq-bio.QM

TLDR

STOMP introduces a novel offline RL algorithm using smooth Tchebysheff scalarization to achieve Pareto-optimal multi-objective alignment, outperforming baselines.

Key contributions

Introduces STOMP, a novel offline RL algorithm for multi-objective alignment using smooth Tchebysheff scalarization.
Addresses limitations of linear scalarization by recovering non-convex Pareto fronts in multi-objective RL.
Extends direct preference optimization to multi-objective settings by standardizing individual rewards.
Achieves state-of-the-art performance on protein engineering tasks, outperforming baselines in 8 of 9 settings.

Why it matters

This work provides a principled approach to multi-objective offline RL, crucial for aligning models like LLMs with complex, conflicting human preferences. By recovering non-convex Pareto fronts, STOMP enables more robust and effective optimization in real-world applications such as protein engineering and chatbot development.

Original Abstract

Large language models can be aligned with human preferences through offline reinforcement learning (RL) on small labeled datasets. While single-objective alignment is well-studied, many real-world applications demand the simultaneous optimization of multiple conflicting rewards, e.g. optimizing both catalytic activity and specificity in protein engineering, or helpfulness and harmlessness for chatbots. Prior work has largely relied on linear reward scalarization, but this approach provably fails to recover non-convex regions of the Pareto front. In this paper, instead of scalarizing the rewards directly, we frame multi-objective RL itself as an optimization problem to be scalarized via smooth Tchebysheff scalarization, a recent technique that overcomes the shortcomings of linear scalarization. We use this formulation to derive Smooth Tchebysheff Optimization of Multi-Objective Preferences (STOMP), a novel offline RL algorithm that extends direct preference optimization to the multi-objective setting in a principled way by standardizing the individual rewards based on their observed distributions. We empirically validate STOMP on a range of protein engineering tasks by aligning three autoregressive protein language models on three laboratory datasets of protein fitness. Compared to state-of-the-art baselines, STOMP achieves the highest hypervolumes in eight of nine settings according to both offline off-policy and generative evaluations. We thus demonstrate that STOMP is a powerful, robust multi-objective alignment algorithm that can meaningfully improve post-trained models for multi-attribute protein optimization and beyond.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers