Can Large Language Models Model Programs Formally?

April 2, 20262604.01851

Zhiyong Chen, Jialun Cao, Jiarong Wu, Chang Xu, Shing-Chi Cheung

cs.SE

TLDR

Model-Bench evaluates LLMs' ability to formally model Python programs for model checking, revealing current limitations and future research paths.

Key contributions

Introduces Model-Bench, a benchmark and pipeline for evaluating LLMs' program modeling for model checking.
Comprises 400 Python programs derived from HumanEval, MBPP, and LiveCodeBench benchmarks.
Converts Python programs into verification-ready model checking specifications checkable by an accompanying tool.
Experiments reveal significant limitations in LLMs' current program modeling and suggest future research directions.

Why it matters

Formal verification is critical for software reliability, yet model checking has lagged due to challenges in automatic program modeling. This paper fills a crucial gap by providing a dedicated benchmark and insights into how LLMs can advance this field, guiding future research efforts.

Original Abstract

In the digital age, ensuring the correctness, safety, and reliability of software through formal verification is paramount, particularly as software increasingly underpins critical infrastructure. Formal verification, split into theorem proving and model checking, provides a feasible and reliable path. Unlike theorem proving, which yields notable advances, model checking has been less focused due to the difficulty of automatic program modeling. To fill this gap, we introduce Model-Bench, a benchmark and an accompanying pipeline for evaluating and improving LLMs' program modeling capability by modeling Python programs into verification-ready model checking specifications checkable by its accompanying model checker. Model-Bench comprises 400 Python programs derived from three well-known benchmarks (HumanEval, MBPP, and LiveCodeBench). Our extensive experiments reveal significant limitations in LLMs' program modeling and further provide inspiring directions.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers