Characterising LLM-Generated Competency Questions: a Cross-Domain Empirical Study using Open and Closed Models

April 17, 20262604.16258

Reham Alharbi, Valentina Tamma, Terry R. Payne, Jacopo de Berardinis

cs.AI

TLDR

This paper empirically characterizes LLM-generated Competency Questions using quantitative measures across diverse models and use cases.

Key contributions

Introduces quantitative measures to systematically compare LLM-generated Competency Questions (CQs).
Identifies key properties of LLM-generated CQs: readability, relevance to input, and structural complexity.
Conducts a cross-domain study using both open and closed LLMs (e.g., GPT-4.1, Llama3.1-8B).
Reveals distinct CQ generation profiles for LLMs, varying based on specific use cases and requirements.

Why it matters

LLMs can automate the creation of Competency Questions, making ontology engineering more accessible. This paper provides a crucial empirical characterization of these LLM-generated CQs. It helps practitioners understand the strengths and weaknesses of different LLMs for this task, guiding model selection and improving requirement elicitation.

Original Abstract

Competency Questions (CQs) are a cornerstone of requirement elicitation in ontology engineering. CQs represent requirements as a set of natural language questions that an ontology should satisfy; they are traditionally modelled by ontology engineers together with domain experts as part of a human-centred, manual elicitation process. The use of Generative AI automates CQ creation at scale, therefore democratising the process of generation, widening stakeholder engagement, and ultimately broadening access to ontology engineering. However, given the large and heterogeneous landscape of LLMs, varying in dimensions such as parameter scale, task and domain specialisation, and accessibility, it is crucial to characterise and understand the intrinsic, observable properties of the CQs they produce (e.g., readability, structural complexity) through a systematic, cross-domain analysis. This paper introduces a set of quantitative measures for the systematic comparison of CQs across multiple dimensions. Using CQs generated from well defined use cases and scenarios, we identify their salient properties, including readability, relevance with respect to the input text and structural complexity of the generated questions. We conduct our experiments over a set of use cases and requirements using a range of LLMs, including both open (KimiK2-1T, LLama3.1-8B, LLama3.2-3B) and closed models (Gemini 2.5 Pro, GPT 4.1). Our analysis demonstrates that LLM performance reflects distinct generation profiles shaped by the use case.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers