Mythos and the Unverified Cage: Z3-Based Pre-Deployment Verification for Frontier-Model Sandbox Infrastructure

April 22, 20262604.20496

cs.CRcs.AI

TLDR

COBALT uses Z3 SMT to formally verify C/C++ infrastructure for arithmetic vulnerabilities, enhancing pre-deployment safety of frontier AI model sandboxes.

Key contributions

COBALT: A Z3 SMT-based engine for pre-deployment detection of CWE-190/191/195 arithmetic vulnerabilities.
Validated COBALT on NASA cFE, wolfSSL, Eclipse Mosquitto, and NASA F Prime, showing real-world detection.
Proposes a four-layer containment framework for frontier AI models, addressing Mythos-like failures.
Argues Mythos escape class is consistent with Z3-expressible CWE-190, detectable via formal analysis.

Why it matters

Frontier AI models face critical containment weaknesses, as shown by the Mythos sandbox escape. This paper introduces formal verification for infrastructure, arguing that behavioral safeguards alone are insufficient. It provides a crucial step towards robust, pre-deployment security for advanced AI systems.

Original Abstract

The April 2026 Claude Mythos sandbox escape exposed a critical weakness in frontier AI containment: the infrastructure surrounding advanced models remains susceptible to formally characterizable arithmetic vulnerabilities. Anthropic has not publicly characterized the escape vector; some secondary accounts hypothesize a CWE-190 arithmetic vulnerability in sandbox networking code. We treat this as unverified and analyze the vulnerability class rather than the specific escape. This paper presents COBALT, a Z3 SMT-based formal verification engine for identifying CWE-190/191/195 arithmetic vulnerability patterns in C/C++ infrastructure prior to deployment. We distinguish two classes of contribution. Validated: COBALT detects arithmetic vulnerability patterns in production codebases, producing SAT verdicts with concrete witnesses and UNSAT guarantees under explicit safety bounds. We demonstrate this on four production case studies: NASA cFE, wolfSSL, Eclipse Mosquitto, and NASA F Prime, with reproducible encodings, verified solver output, and acknowledged security outcomes. Proposed: a four-layer containment framework consisting of COBALT, VERDICT, DIRECTIVE-4, and SENTINEL, mapping pre-deployment verification, pre-execution constraints, output control, and runtime monitoring to the failure modes exposed by the Mythos incident. Under explicit assumptions, we further argue that the publicly reported Mythos escape class is consistent with a Z3-expressible CWE-190 arithmetic formulation and that pre-deployment formal analysis would have been capable of surfacing the relevant pattern. The broader claim is infrastructural: frontier-model safety cannot depend on behavioral safeguards alone; the containment stack itself must be subjected to formal verification.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers