LLMs Are Not a Silver Bullet: A Case Study on Software Fairness
Xinyue Li, Sixuan Li, Ying Xiao, Jie M. Zhang, Zhou Yang + 2 more
TLDR
A study finds traditional ML methods consistently outperform LLMs for software fairness and predictive performance, challenging LLMs' utility in this domain.
Key contributions
- ML-based methods consistently outperform LLM-based methods in both fairness and predictive performance.
- Prior LLM gains were attributed to artificially balanced test data, not realistic imbalanced distributions.
- Existing LLM methods relying on in-context learning fail to leverage full training data effectively.
- Supervised fine-tuning for LLMs achieves competitive results but offers limited advantages over traditional ML.
Why it matters
This paper provides crucial guidance for software engineers on bias mitigation. It clarifies that LLMs are not a superior solution for software fairness compared to traditional ML, especially in realistic data settings. This helps prevent misallocation of resources and sets realistic expectations for LLM application in critical systems.
Original Abstract
Fairness is a critical requirement for human-related, high-stakes software systems, motivating extensive research on bias mitigation. Prior work has largely focused on tabular data settings using traditional Machine Learning (ML) methods. With the rapid rise of Large Language Models (LLMs), recent studies have begun to explore their use for bias mitigation in the same setting. However, it remains unclear whether LLM-based methods offer advantages over traditional ML methods, leaving software engineers without clear guidance for practical adoption. To address this gap, we present a large-scale study comparing state-of-the-art ML- and LLM-based bias mitigation methods. We find that ML-based methods consistently outperform LLM-based methods in both fairness and predictive performance, with even strong LLMs failing to surpass established ML baselines. To understand why prior LLM-based studies report favorable results, we analyze their evaluation settings and show that these gains are largely driven by artificially balanced test data rather than realistic imbalanced distributions. We further observe that existing LLM-based methods primarily rely on in-context learning and thus fail to leverage all available training data. Motivated by this, we explore supervised fine-tuning on the full training set and find that, while it achieves competitive results, its advantages over traditional ML methods remain limited. These findings suggest that LLMs are not a silver bullet for software fairness.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.