Prompt Engineering Strategies for LLM-based Qualitative Coding of Psychological Safety in Software Engineering Communities: A Controlled Empirical Study

May 8, 20262605.07422

Moaath Alshaikh, Tasneem Alshaher, Ricardo Vieira, Beatriz Santana, Clelio Xavier + 5 more

cs.SEcs.AI

TLDR

This study evaluates prompt engineering strategies for LLM-based qualitative coding of psychological safety, finding multi-shot improves Claude Haiku.

Key contributions

Evaluated Claude Haiku, DeepSeek-Chat, and Gemini 2.5 Flash with zero-shot and multi-shot prompts.
Multi-shot prompting significantly improved agreement for Claude Haiku, but not for other LLMs.
DeepSeek-Chat and Claude Haiku showed higher intra-model stability than Gemini 2.5 Flash.
All models consistently over-predicted "Sharing Negative Feedback" and under-predicted "Expressing Concerns".

Why it matters

This paper provides crucial empirical evidence for using LLMs in qualitative coding, a demanding process. It offers prompt engineering guidelines, helping researchers reliably leverage LLMs for analyzing human and social aspects in software engineering. This advances the practical application of LLMs in research.

Original Abstract

Qualitative analysis plays a pivotal role in understanding the human and social aspects of software engineering. However, it remains a demanding process shaped by the subjective interpretation of individual researchers and sensitive to methodological choices such as prompt design. Recent advancements in Large Language Models (LLMs) offer promising opportunities to support this type of analysis, although their reliability in reproducing human qualitative reasoning under varying prompting conditions remains largely untested. This study presents a controlled empirical evaluation of three LLMs -- Claude Haiku, DeepSeek-Chat, and Gemini 2.5 Flash -- across two prompt engineering strategies (zero-shot and multi-shot closed coding), using Cohen's kappa as the primary agreement metric over ten independent runs per configuration. Results suggest that multi-shot prompting significantly improves agreement for Claude Haiku (Delta kappa = +0.034, Wilcoxon p = 0.004) but not for DeepSeek-Chat or Gemini 2.5 Flash. Intra-model stability varies substantially -- DeepSeek-Chat and Claude Haiku exhibit the lowest variance (SD approx. 0.017), while Gemini 2.5 Flash is the least stable (SD = 0.038). A systematic over-prediction of "Sharing Negative Feedback" is identified across all models (bias ratios up to 5.25x), alongside consistent under-prediction of "Expressing Concerns." Collectively, these findings provide empirical evidence for prompt engineering guidelines in LLM-assisted qualitative coding for software engineering research.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers