Syntax Is Easy, Semantics Is Hard: Evaluating LLMs for LTL Translation

April 8, 20262604.07321

Priscilla Kyei Danso, Mohammad Saqib Hasan, Niranjan Balasubramanian, Omar Chowdhury

cs.LOcs.AI

TLDR

LLMs can translate natural language to LTL, but struggle with semantics; code-completion prompts significantly improve performance.

Key contributions

Evaluated LLMs for translating English sentences into LTL formulas using human and synthetic data.
Found LLMs perform better on LTL syntactic aspects than semantic ones.
Showed that more detailed prompts generally improve LLM performance.
Demonstrated substantial performance improvement by framing the task as Python code-completion.

Why it matters

This paper addresses the challenge of translating natural language into LTL, a critical step for using security and privacy analysis tools. By identifying LLM strengths and weaknesses, especially the benefit of code-completion prompts, it provides practical guidance for improving LTL translation. This work helps make complex formal methods more accessible to developers.

Original Abstract

Propositional Linear Temporal Logic (LTL) is a popular formalism for specifying desirable requirements and security and privacy policies for software, networks, and systems. Yet expressing such requirements and policies in LTL remains challenging because of its intricate semantics. Since many security and privacy analysis tools require LTL formulas as input, this difficulty places them out of reach for many developers and analysts. Large Language Models (LLMs) could broaden access to such tools by translating natural language fragments into LTL formulas. This paper evaluates that premise by assessing how effectively several representative LLMs translate assertive English sentences into LTL formulas. Using both human-generated and synthetic ground-truth data, we evaluate effectiveness along syntactic and semantic dimensions. The results reveal three findings: (1) in line with prior findings, LLMs perform better on syntactic aspects of LTL than on semantic ones; (2) they generally benefit from more detailed prompts; and (3) reformulating the task as a Python code-completion problem substantially improves overall performance. We also discuss challenges in conducting a fair evaluation on this task and conclude with recommendations for future work.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers