ArXiv TLDR

Neurosymbolic Auditing of Natural-Language Software Requirements

🐦 Tweet
2605.13817

Bethel Hall, William Eiers

cs.SEcs.AI

TLDR

A neurosymbolic approach using LLMs and SMT solvers audits natural-language software requirements, detecting ambiguity and inconsistencies.

Key contributions

  • VERIMED, a neurosymbolic pipeline, translates natural language requirements into formal logic.
  • It detects ambiguity by analyzing stochastic variation in LLM-generated formalizations using SMT equivalence.
  • SMT solver queries expose inconsistencies, vacuousness, and safety violations in requirements.
  • Granular SMT counterexamples significantly improve verification accuracy from 55.4% to 98.5%.

Why it matters

Ambiguous and inconsistent software requirements in safety-critical systems can lead to dangerous failures. This paper offers a robust neurosymbolic method to automatically audit such requirements. By combining LLMs with SMT solvers, it ensures higher accuracy and safety in critical software development.

Original Abstract

Natural-language software requirements are often ambiguous, inconsistent, and underspecified; in safety-critical domains, these defects propagate into formal models that verify the wrong specification and into implementations that ship unsafe behavior. We show that large language models, equipped with an SMT solver, can audit such requirements: translating them into formal logic, detecting ambiguity through stochastic variation in the generated formalization, and exposing inconsistency, vacuousness, and safety violations through solver queries on the resulting specification. We present VERIMED, a neurosymbolic pipeline that operationalizes this idea for medical-device software requirements, and report two findings. First, stochastic variation across independent formalizations is a signal of ambiguity: requirements that admit multiple plausible interpretations produce SMT-inequivalent formalizations, and bidirectional SMT equivalence checking turns this disagreement into a solver-checkable test. Second, the usefulness of symbolic feedback depends on its granularity: in counterexample-guided repair on a hemodialysis question-answering benchmark, concrete SMT counterexamples raise verified accuracy from 55.4% to 98.5%. Over an extensive experimental evaluation on open-source hemodialysis safety requirements, we show that the LLM-based approach in VERIMED successfully reduces ambiguity-sensitive requirements and enables rigorous auditing of software requirements through SMT-based queries.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.