From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python

April 13, 20262604.11518

cs.SEcs.AI

TLDR

This paper details an LLM-assisted, benchmark-driven migration of a production AI agent from Rust to Python, achieving near-parity and feature expansion.

Key contributions

Python port of Codex CLI achieves near-parity on agentic tasks, resolving more SWE-bench tasks than Rust.
Benchmark-driven debugging effectively identifies complex issues like API mismatches and silent failures.
The architecture enables continuous upstream synchronization using an LLM-assisted diff-translate-test loop.
The Python port evolved into a superset with 30 new features, while maintaining a strict parity mode.

Why it matters

This paper offers a principled methodology for LLM-assisted, benchmark-driven cross-language migration of rapidly evolving software, particularly for AI agents. It demonstrates how Python's expressiveness can reduce code with minimal performance cost, while benchmarks guide the port's evolution into a feature-rich platform.

Original Abstract

Cross-language migration of large software systems is a persistent engineering challenge, particularly when the source codebase evolves rapidly. We present a methodology for LLM-assisted continuous code translation in which a large language model translates a production Rust codebase (648K LOC, 65 crates) into Python (41K LOC, 28 modules), with public agent benchmarks as the objective function driving iterative refinement. Our subject system is Codex CLI, a production AI coding agent. We demonstrate that: (1) the Python port resolves 59/80 SWE-bench Verified tasks (73.8%) versus Rust's 56/80 (70.0%), and achieves 42.5% on Terminal-Bench versus Rust's 47.5%, confirming near-parity on real-world agentic tasks; (2) benchmark-driven debugging, revealing API protocol mismatches, environment pollution, a silent WebSocket failure mode, and an API 400 crash, is more effective than static testing alone; (3) the architecture supports continuous upstream synchronisation via an LLM-assisted diff-translate-test loop; and (4) the Python port has evolved into a capability superset with 30 feature-flagged extensions (multi-agent orchestration, semantic memory, guardian safety, cost tracking) absent from Rust, while preserving strict parity mode for comparison. Our evaluation shows that for LLM-based agents where API latency dominates, Python's expressiveness yields a 15.9x code reduction with negligible performance cost, while the benchmark-as-objective-function methodology provides a principled framework for growing a cross-language port from parity into an extended platform.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers