Is Vibe Coding the Future? An Empirical Assessment of LLM Generated Codes for Construction Safety

April 14, 20262604.12311

cs.SEcs.AIcs.HC

TLDR

Study finds LLM-generated code for construction safety has a high silent failure rate (45%), making it unreliable without deterministic wrappers.

Key contributions

Empirically evaluated 450 LLM-generated Python scripts for construction safety across three frontier models.
Identified an alarming ~45% overall "Silent Failure Rate" where code compiles but has flawed safety logic.
GPT-4o-Mini generated mathematically inaccurate outputs in ~56% of its functional code.
Less formal prompts significantly increase the AI's propensity to hallucinate missing safety variables.

Why it matters

This paper critically assesses LLM-generated code for construction safety, revealing a high "silent failure" rate where code appears functional but contains severe logical flaws. It underscores the urgent need for deterministic AI wrappers and strict governance before deploying LLMs in safety-critical cyber-physical systems.

Original Abstract

The emergence of vibe coding, a paradigm where non-technical users instruct Large Language Models (LLMs) to generate executable codes via natural language, presents both significant opportunities and severe risks for the construction industry. While empowering construction personnel such as the safety managers, foremen, and workers to develop tools and software, the probabilistic nature of LLMs introduces the threat of silent failures, wherein generated code compiles perfectly but executes flawed mathematical safety logic. This study empirically evaluates the reliability, software architecture, and domain-specific safety fidelity of 450 vibe-coded Python scripts generated by three frontier models, Claude 3.5 Haiku, GPT-4o-Mini, and Gemini 2.5 Flash. Utilizing a persona-driven prompt dataset (n=150) and a bifurcated evaluation pipeline comprising isolated dynamic sandboxing and an LLM-as-a-Judge, the research quantifies the severe limits of zero-shot vibe codes for construction safety. The findings reveal a highly significant relationship between user persona and data hallucination, demonstrating that less formal prompts drastically increase the AI's propensity to invent missing safety variables. Furthermore, while the models demonstrated high foundational execution viability (~85%), this syntactic reliability actively masked logic deficits and a severe lack of defensive programming. Among successfully executed scripts, the study identified an alarming ~45% overall Silent Failure Rate, with GPT-4o-Mini generating mathematically inaccurate outputs in ~56% of its functional code. The results demonstrate that current LLMs lack the deterministic rigor required for standalone safety engineering, necessitating the adoption of deterministic AI wrappers and strict governance for cyber-physical deployments.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers