From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation

April 30, 20262604.27969

cs.SEcs.AI

TLDR

New research reveals MLLMs often ignore visual input when generating code from circuit diagrams, proposing VeriGround to achieve genuine visual grounding.

Key contributions

MLLMs exhibit "Mirage," bypassing visual input for code generation from circuit diagrams by exploiting identifier semantics.
Introduced C2VEVAL and a Normal/Anony protocol, showing MLLMs' high accuracy is largely a "Mirage" when identifiers are anonymized.
Proposed VeriGround (4B), trained with anonymization, refusal augmentation, and D-ORPO for genuine visual grounding.
VeriGround achieves strong functional pass rates and high refusal on blank images, outperforming baselines under anonymized conditions.

Why it matters

This paper uncovers a critical flaw ("Mirage") in MLLMs' ability to genuinely interpret visual inputs for code generation, especially in safety-critical domains like hardware design. By introducing VeriGround, it provides a robust solution for reliable multimodal code generation, enhancing trust and paving the way for safer AI-assisted engineering.

Original Abstract

Multimodal large language models (MLLMs) are increasingly used to translate visual artifacts into code, from UI mockups into HTML to scientific plots into Python scripts. A circuit diagram can be viewed as a visual domain-specific language for hardware: it encodes timing, topology, and bit level semantics that are invisible to casual inspection yet safety critical once fabricated in silicon. Translating such diagrams into register-transfer-level(RTL) code therefore represents an extreme reliability test for vision-to-code generation. We reveal a phenomenon we call Mirage: replacing a circuit diagram with a blank image leaves Pass@k unchanged or even higher, because models bypass the visual input and instead exploit identifier semantics in the module header to retrieve canonical RTL templates. This constitutes a new, highly covert class of defect in AI-assisted code generation that directly undermines MLLMs' trustworthiness. To quantify the effect, we construct C2VEVAL and evaluate eight MLLMs under a paired Normal/Anony protocol in which Anony mode anonymizes all identifiers in both the diagram and the module header; Anony-mode scores drop sharply across all models, confirming that high Normal-mode accuracy is largely a Mirage. We then propose VeriGround (4B), trained with identifier anonymization, refusal augmentation, and D-ORPO (Decision-Focused ORPO) preference alignment that up-weights pivotal generate-or-refuse tokens. VeriGround achieves Functional Pass@1 of 46.11%/42.51%(Normal/Anony) with a False Refusal Rate of only 1.20%/0.00%, while maintaining >92% Refusal Rate on blank images. With only 4B parameters, VeriGround performs on par with GPT-5.4 under Normal and significantly outperforms all baselines under Anony, confirming genuine visual grounding.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers