ArXiv TLDR

Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models

🐦 Tweet
2604.19598

Kihyuk Lee

cs.CLcs.AI

TLDR

A study reveals varying consistency and generative behaviors among GPT-4.1, Claude Sonnet, and Gemini Flash for exercise prescriptions.

Key contributions

  • Compared GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash for exercise prescription consistency.
  • GPT-4.1 had high semantic similarity with unique outputs; Gemini Flash showed high similarity via repetition.
  • Identical decoding settings produced fundamentally different consistency profiles across LLMs.
  • Safety expression was uniformly high, making it an ineffective metric for model differentiation.

Why it matters

This research reveals significant differences in how LLMs generate consistent outputs for exercise prescriptions. It emphasizes that model selection for clinical tools requires evaluating repeated generation behavior, not just single outputs, to ensure reliability.

Original Abstract

This study compared repeated generation consistency of exercise prescription outputs across three large language models (LLMs), specifically GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash, under temperature=0 conditions. Each model generated prescriptions for six clinical scenarios 20 times, yielding 360 total outputs analyzed across four dimensions: semantic similarity, output reproducibility, FITT classification, and safety expression. Mean semantic similarity was highest for GPT-4.1 (0.955), followed by Gemini 2.5 Flash (0.950) and Claude Sonnet 4.6 (0.903), with significant inter-model differences confirmed (H = 458.41, p < .001). Critically, these scores reflected fundamentally different generative behaviors: GPT-4.1 produced entirely unique outputs (100%) with stable semantic content, while Gemini 2.5 Flash showed pronounced output repetition (27.5% unique outputs), indicating that its high similarity score derived from text duplication rather than consistent reasoning. Identical decoding settings thus yielded fundamentally different consistency profiles, a distinction that single-output evaluations cannot capture. Safety expression reached ceiling levels across all models, confirming its limited utility as a differentiating metric. These results indicate that model selection constitutes a clinical rather than merely technical decision, and that output behavior under repeated generation conditions should be treated as a core criterion for reliable deployment of LLM-based exercise prescription systems.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.