(How) Do Large Language Models Understand High-Level Message Sequence Charts?

May 13, 20262605.13773

cs.SEcs.AIcs.LO

TLDR

LLMs show only a modest understanding (52% accuracy) of High-Level Message Sequence Charts' formal semantics, struggling with complex reasoning tasks.

Key contributions

Evaluated three LLMs (Gemini-3, GPT-5.4, Qwen-3.6) on 129 HMSC semantic tasks.
Overall LLM accuracy for HMSC formal semantics was modest, around 52%.
LLMs understood basic HMSC concepts (88%) but struggled with abstraction, composition, and trace calculations (36-42%).
All LLMs failed to correctly handle co-regions and explicit causal dependencies.

Why it matters

This paper reveals significant limitations in current LLMs' ability to consistently apply formal semantics to architectural design specifications like HMSCs. It highlights a critical gap in their understanding for reliable automation of complex software development tasks.

Original Abstract

Large Language Models (LLMs) are being employed widely to automate tasks across the software development life-cycle. It is, however, unclear whether these tasks are performed consistently with respect to the semantics of the artefacts being handled. This question is particularly under-researched concerning architectural design specification. In this paper, we address this question for High-Level Message Sequence Charts (HMSCs). These are visual models with a rigorous formal semantics that have been used for various purposes, including as a foundation for Sequence Diagrams in the Unified Modelling Language (UML). We examine whether LLMs "understand" the semantics of HMSCs by examining three LLMs (Gemini-3, GPT-5.4, and Qwen-3.6) on how they perform 129 semantic tasks ranging from querying basic semantic constructs in HMSCs (i.e., events and their ordering) to semantic-preserving abstractions and compositions, and calculating the set of traces and trace-equivalent labelled transition systems. The results show that LLMs only have a modest understanding of the formal semantics of HMSCs (ca. 52% overall accuracy), with great variability across different semantic concepts: while LLMs seem to understand the basic semantic concepts of MSCs (ca. 88% accuracy), they struggle with semantic reasoning in tasks involving abstraction and composition (ca. 36% accuracy) and traces and LTSs (ca. 42% accuracy). In particular, all three LLMs struggle with the notions of co-region and explicit causal dependencies and never employed them in semantic-preserving transformations.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers