ArXiv TLDR

The Infinite Mutation Engine? Measuring Polymorphism in LLM-Generated Offensive Code

🐦 Tweet
2605.03619

Gabriel Hortea, Juan Tapiador

cs.CR

TLDR

LLMs can generate highly polymorphic, behaviorally identical offensive code, posing a significant threat to signature-based malware detection.

Key contributions

  • Measured polymorphic capacity of Claude Opus 4.6 for generating diverse, behaviorally identical malware.
  • Developed a dual-agent, four-stage pipeline to generate, test, and refine data-exfiltration payloads.
  • Found LLMs produce high structural diversity but low semantic distance without explicit polymorphism requests.
  • Explicit prompting significantly amplifies structural diversity, preserving correctness with minor cost increase.

Why it matters

This paper reveals that LLMs can cheaply produce highly polymorphic malware, making traditional signature-based detection increasingly ineffective. It underscores a critical emerging threat in cybersecurity, demanding new defensive strategies against AI-generated offensive tools.

Original Abstract

Malware authors have traditionally relied on polymorphic techniques to produce variants in the same malware family, complicating signature-based detection. Integrating generative AI into offensive toolchains enables attackers to synthesize structurally diverse payloads with identical behavior, raising the question of how much polymorphism LLMs provide. Recent work has assumed that LLMs can produce sufficiently polymorphic payloads, leaving unquantified the variation that emerges when an attacker repeatedly builds the same payload, or explicitly instructs the model to avoid prior implementations. In this work, we measure the polymorphic capacity of a commercial model (Claude Opus 4.6) as an automated malware generator. We build a dual-agent, four-stage pipeline that generates, tests, and refines a data-exfiltration payload comprising file traversal, encryption, exfiltration, and integration. We produce payloads in two settings: using prompts that specify only functional requirements, and using prompts that inject a structured history of prior outcomes to force divergence. We measure pairwise distances along structural (AST) and semantic (embedding) axes, finding that when polymorphism is not explicitly required, structural distances are high while semantic distances remain low; i.e., implementations diverge widely without changing high-level behavior. Explicit prompting substantially amplifies this structural diversity while preserving correctness, at the cost of roughly 5 times more tokens but only a small increase in LLM calls (from $4.2$ to $4.5$ per payload, with effective API costs of \$0.41 and \$0.73). These results show that a single commercial LLM can cheaply generate large populations of behaviorally equivalent yet structurally diverse payloads, facilitating the evasion of signature-based detection rules and similarity-based clustering.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.