From Imitation to Discrimination: Progressive Curriculum Learning for Robust Web Navigation

April 14, 20262604.12666

Chuang Peng, Wei Zhang, Renshuai Tao, Xinhao Zhang, Jian Yang

cs.LGcs.CLcs.HC

TLDR

This paper introduces the Triton dataset and a progressive curriculum for robust web navigation, achieving SOTA performance and surpassing large LMs.

Key contributions

Introduces Triton, a 590k-instance dataset built with hard negative mining and dual-agent consensus for diverse web tasks.
Develops a progressive training curriculum with Triton-SFT, Triton-ORPO, and Triton-GRPO models.
Utilizes Odds Ratio Preference Optimization (ORPO) for discrimination and Group Relative Policy Optimization (GRPO) for long-horizon consistency.
Achieves state-of-the-art 58.7% Step Success Rate on Mind2Web, outperforming GPT-4.5 and Claude-4.5 by over 16%.

Why it matters

This work significantly advances text-based web navigation by tackling issues of discrimination and generalization. It demonstrates that a specialized data curriculum, combined with progressive optimization techniques, can surpass the performance of much larger general-purpose models like GPT-4.5, highlighting a path for more efficient and robust web agents.

Original Abstract

Text-based web agents offer computational efficiency for autonomous web navigation, yet developing robust agents remains challenging due to the noisy and heterogeneous nature of real-world HTML. Standard Supervised Fine-Tuning (SFT) approaches fail in two critical dimensions: they lack discrimination capabilities to reject plausible but incorrect elements in densely populated pages, and exhibit limited generalization to unseen website layouts. To address these challenges, we introduce the Triton dataset (590k instances) and a progressive training curriculum. Triton is constructed via Structural-Semantic Hard Negative Mining, which explicitly mines topologically similar distractors, and a Dual-Agent Consensus pipeline that synthesizes diverse cross-domain tasks with strict verification. Building upon this foundation, our progressive curriculum produces three models: Triton-SFT-32B for basic imitation, Triton-ORPO-32B for robust discrimination via Odds Ratio Preference Optimization, and Triton-GRPO-32B for long-horizon consistency through Group Relative Policy Optimization. Empirical evaluation on Mind2Web demonstrates that Triton-GRPO-32B achieves state-of-the-art performance among open-source models with 58.7% Step Success Rate, surpassing GPT-4.5 (42.4%) and Claude-4.5 (41.4%) by over 16%, validating that specialized data curriculum outweighs raw parameter scale for web navigation.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers