The Readability Spectrum: Patterns, Issues, and Prompt Effects in LLM-Generated Code
Hengzhi Ye, Fengyuan Ran, Weiwei Xu, Minghui Zhou
TLDR
LLMs generate code with readability comparable to human code but distinct issue patterns, with prompt design having limited impact.
Key contributions
- Developed a comprehensive readability model combining textual, structural, program, and visual code features.
- Evaluated LLM-generated code readability across 5,869 scenarios from WoC and LeetCode.
- Found LLM code readability is comparable to human code but shows distinct issue patterns.
- Identified function signatures, constraints, and style descriptions as most influential prompt factors.
Why it matters
This paper validates LLM-generated code's potential for integration into software workflows from a non-functional perspective. It also highlights a latent technical debt due to distinct readability issues and limited prompt effectiveness, guiding future research for better maintainability.
Original Abstract
As Large Language Models (LLMs) are transforming software development, the functional quality of generated code has become a central focus, leaving readability, one of critical non-functional attributes, understudied. Given that LLM-generated code still needs human review before adoption, it is important to understand its readability especially compared with human-written code and the role of prompt design in shaping it. We therefore set out to conduct a systematic investigation into the code readability of LLM-generated code. To systematically quantify code readability, We establish a comprehensive readability model that synthesizes textual, structural, program, and visual features of code. Based on the model, we evaluate the readability of code generated by the mainstream LLMs under 5,869 scenarios extracted from large code base including World of Code (WoC) and LeetCode. We find that current LLMs produce code with overall readability comparable to human-written code, but displaying distinct readability issue patterns. We further examine how different prompt dimensions affect the readability of LLM-generated code, and find that function signatures, constraints and style descriptions emerge as the most influential factors, while the overall impact of prompt design remains limited. Our findings indicate that, on one hand, LLM-generated code is at least comparable to human-written code in readability, validating its potential for systematic integration into software workflows from a non-functional perspective; on the other hand, distinct readability issue patterns and limited effectiveness of prompt engineering reveal a latent technical debt, highlighting the need for future research to improve the readability of LLM-generated code and thus ensure long-term maintainability.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.