From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification

April 24, 20262604.22601

Md Erfan, Md Kamal Hossain Chowdhury, Ahmed Ryan, Md Rayhanur Rahman

cs.SEcs.AI

TLDR

This paper introduces NL2VC-60, a dataset for formally verifying LLM-generated code, showing tiered prompting significantly improves verification success.

Key contributions

Introduces NL2VC-60, a dataset of 60 complex algorithmic problems for formal verification.
Evaluates LLMs using tiered prompting: contextless, signature-guided, and self-healing with verifier feedback.
Integrates uDebug to prevent vacuous verification by ensuring functional correctness.
Achieves high verification success rates (Gemma 90.91%, GPT-OSS 81.82%) with signature and self-healing prompts.

Why it matters

LLMs often produce incorrect code, hindering their use in critical systems. This work demonstrates that formal verification, previously challenging, is now achievable for open-weight LLMs. This breakthrough enables high-assurance software development with AI assistance.

Original Abstract

Large Language Models (LLMs) show promise in automated software engineering, yet their guarantee of correctness is frequently undermined by erroneous or hallucinated code. To enforce model honesty, formal verification requires LLMs to synthesize implementation logic alongside formal specifications that are subsequently proven correct by a mathematical verifier. However, the transition from informal natural language to precise formal specification remains an arduous task. Our work addresses this by providing the NaturalLanguage2VerifiedCode (NL2VC)-60 dataset: a collection of 60 complex algorithmic problems. We evaluate 11 randomly selected problem sets across seven open-weight LLMs using a tiered prompting strategy: contextless prompts, signature prompts providing structural anchors, and self-healing prompts utilizing iterative feedback from the Dafny verifier. To address vacuous verification, where models satisfy verifiers with trivial specifications, we integrate the uDebug platform to ensure functional validation. Our results show that while contextless prompting leads to near-universal failure, structural signatures and iterative self-healing facilitate a dramatic performance turnaround. Specifically, Gemma 4-31B achieved a 90.91\% verification success rate, while GPT-OSS 120B rose from zero to 81.82\% success with signature-guided feedback. These findings indicate that formal verification is now attainable for open-weight LLMs, which serve as effective apprentices for synthesizing complex annotations and facilitating high-assurance software development.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers