SecGoal: A Benchmark for Security Goal Extraction and Formalization from Protocol Documents
Dawei Huang, Hui Li, Haonan Feng, Jingjing Guan, Yueshuang Jiao + 1 more
TLDR
This paper introduces SecGoal, a benchmark and AI-assisted framework for extracting and formalizing security goals from protocol documents.
Key contributions
- Introduces SecGoal, the first expert-annotated benchmark for security goal extraction from 15 widely deployed protocols.
- Presents AIFG, an AI-assisted framework for context-aware goal extraction and retrieval-augmented formalization.
- Reveals frontier LLMs have low precision (<15%) in identifying security goals from operational text.
- Demonstrates instruction tuning on SecGoal enables compact models (7B/9B) to achieve >80% F1-scores.
Why it matters
Automating security goal extraction from natural language is a major bottleneck in formal verification, hindering rigorous cryptographic security guarantees. SecGoal provides a crucial dataset and framework to advance this field. It shows that specialized, smaller models can significantly outperform larger general-purpose LLMs for this complex task.
Original Abstract
Formal verification provides rigorous guarantees for cryptographic security, yet automating the extraction and formalization of security goals from natural language protocol documents remains a major bottleneck, compounded by the scarcity of expert-annotated resources and integrated frameworks bridging unstructured text and symbolic logic. We introduce SecGoal, the first expert-annotated benchmark covering 15 widely deployed protocol documents, including 5G-AKA and TLS 1.3, and AIFG, an AI-assisted framework that decomposes the task into context-aware goal extraction and retrieval-augmented formalization. We conduct a comprehensive evaluation to assess whether contemporary LLMs are ready to automate this pipeline. Our results reveal a pronounced precision-recall imbalance: frontier models, such as Gemini 2.5-Pro, achieve high recall but precision below 15%, frequently misclassifying operational text as security goals. In contrast, instruction tuning on SecGoal enables compact models with 7B/9B parameters to achieve F1-scores above 80%, substantially outperforming larger general-purpose models. Our work establishes a foundational dataset and reproducible baseline for automated formal protocol analysis.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.