ArXiv TLDR

Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation

🐦 Tweet
2605.07985

Joon Ha Kim, Geon-Woo Kim, Anoop Rachakonda, Daehyeok Kim

cs.DCcs.AI

TLDR

Dooly introduces a configuration-agnostic, redundancy-aware profiling method for LLM inference simulation, significantly reducing profiling time.

Key contributions

  • Exploits LLM operation structure for configuration-agnostic, redundancy-aware profiling.
  • Uses taint propagation to label input dimensions and selectively profile only new operations.
  • Isolates stateful operations by reusing serving engine code, eliminating manual instrumentation.
  • Reduces profiling GPU-hours by 56.4% with high simulation accuracy (5% TTFT, 8% TPOT).

Why it matters

Optimizing LLM inference is crucial, but current profiling methods are slow and expensive. Dooly drastically cuts profiling time by 56.4% while maintaining high accuracy. This enables faster, more cost-effective exploration of LLM configurations, accelerating research and deployment.

Original Abstract

Selecting the optimal LLM inference configuration requires evaluation across hardware, serving engines, attention backends, and model architectures, since no single choice performs best across all workloads. Profile-based simulators are the standard tool, yet they hardcode their operation set to a specific configuration and re-profile every operation from scratch, making exploration prohibitively expensive. This cost stems from a missing structural understanding: every input dimension of each operation is fixed by the model configuration or determined by the incoming request. Many model-configuration values (e.g., head size, layer count) recur across models, so the same operation runs in many configurations; a single sweep over the request-dependent dimensions can serve them all. We present Dooly, which exploits this structure to achieve configuration-agnostic, redundancy-aware profiling. Dooly performs a single inference pass, labels each input dimension with its origin via taint propagation, and selectively profiles only operations absent from its latency database; stateful operations such as attention are isolated by reusing the serving engine's own initialization code, eliminating manual instrumentation. It builds latency regression models based on the database, which becomes a drop-in backend for existing simulators. Across two GPU platforms, three attention backends, and diverse model architectures, Dooly achieves simulation accuracy within 5% MAPE for TTFT and 8% for TPOT while reducing profiling GPU-hours by 56.4% across 12 models compared to the existing profiling approach.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.