ArXiv TLDR

Local-Splitter: A Measurement Study of Seven Tactics for Reducing Cloud LLM Token Usage on Coding-Agent Workloads

🐦 Tweet
2604.12301

Justice Owusu Agyemang, Jerry John Kponyo, Elliot Amponsah, Godfred Manu Addo Boakye, Kwame Opuni-Boachie Obour Agyekum

cs.DCcs.AIcs.SE

TLDR

This study explores seven tactics using local models to reduce cloud LLM token usage on coding-agent workloads, achieving significant savings.

Key contributions

  • Systematically measures seven tactics for reducing cloud LLM token usage with a local model triage layer.
  • Implements tactics in an open-source shim supporting Ollama and OpenAI-compatible endpoints.
  • Finds local routing + prompt compression saves 45-79% cloud tokens on edit/explanation workloads.
  • Demonstrates optimal tactic subsets are workload-dependent, crucial for practitioners.

Why it matters

This paper provides practical strategies for developers to significantly cut costs and optimize cloud LLM usage in coding agents. By showing that token-saving tactics are workload-dependent, it guides practitioners in selecting the most effective approaches for their specific applications.

Original Abstract

We present a systematic measurement study of seven tactics for reducing cloud LLM token usage when a small local model can act as a triage layer in front of a frontier cloud model. The tactics are: (1) local routing, (2) prompt compression, (3) semantic caching, (4) local drafting with cloud review, (5) minimal-diff edits, (6) structured intent extraction, and (7) batching with vendor prompt caching. We implement all seven in an open-source shim that speaks both MCP and the OpenAI-compatible HTTP surface, supporting any local model via Ollama and any cloud model via an OpenAI-compatible endpoint. We evaluate each tactic individually, in pairs, and in a greedy-additive subset across four coding-agent workload classes (edit-heavy, explanation-heavy, general chat, RAG-heavy). We measure tokens saved, dollar cost, latency, and routing accuracy. Our headline finding is that T1 (local routing) combined with T2 (prompt compression) achieves 45-79% cloud token savings on edit-heavy and explanation-heavy workloads, while on RAG-heavy workloads the full tactic set including T4 (draft-review) achieves 51% savings. We observe that the optimal tactic subset is workload-dependent, which we believe is the most actionable finding for practitioners deploying coding agents today.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.