Feedback-Driven Execution for LLM-Based Binary Analysis

April 16, 20262604.15136

cs.CR

TLDR

FORGE uses feedback-driven execution and a Dynamic Forest of Agents to improve LLM-based binary analysis, finding 1,274 vulnerabilities.

Key contributions

Introduces FORGE, a feedback-driven system for LLM-based binary analysis using a reasoning-action-observation loop.
Employs a Dynamic Forest of Agents (FoA) for decomposed, parallel exploration and bounded per-agent context.
Achieves 72.3% precision, identifying 1,274 vulnerabilities across 591 real-world firmware binaries.
Covers a broader range of vulnerability types compared to existing one-pass LLM analysis approaches.

Why it matters

Existing LLM binary analysis struggles with long-horizon tasks due to one-pass execution. FORGE's feedback-driven approach and decomposed agents enable adaptive, scalable exploration, significantly improving vulnerability detection. This advances LLM utility for complex program understanding.

Original Abstract

Binary analysis increasingly relies on large language models (LLMs) to perform semantic reasoning over complex program behaviors. However, existing approaches largely adopt a one-pass execution paradigm, where reasoning operates over a fixed program representation constructed by static analysis tools. This formulation limits the ability to adapt exploration based on intermediate results and makes it difficult to sustain long-horizon, multi-path analysis under constrained context. We present FORGE, a system that rethinks LLM-based analysis as a feedback-driven execution process. FORGE interleaves reasoning and tool interaction through a reasoning-action-observation loop, enabling incremental exploration and evidence construction. To address the instability of long-horizon reasoning, we introduce a Dynamic Forest of Agents (FoA), a decomposed execution model that dynamically coordinates parallel exploration while bounding per-agent context. We evaluate FORGE on 3,457 real-world firmware binaries. FORGE identifies 1,274 vulnerabilities across 591 unique binaries, achieving 72.3% precision while covering a broader range of vulnerability types than prior approaches. These results demonstrate that structuring LLM-based analysis as a decomposed, feedback-driven execution system enables both scalable reasoning and high-quality outcomes in long-horizon tasks.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers