From If-Statements to ML Pipelines: Revisiting Bias in Code-Generation

April 23, 20262604.21716

Minh Duc Bui, Xenia Heilmann, Mattia Cerrato, Manuel Mager, Katharina von der Wense

cs.CLcs.SE

TLDR

LLMs generating ML pipelines exhibit significantly more bias in feature selection than simple conditional statements, underestimating real-world risks.

Key contributions

Prior bias evaluation using simple if-statements significantly underestimates real-world bias in code generation.
LLMs generating ML pipelines exhibit 87.7% sensitive attribute inclusion, much higher than 59.2% in if-statements.
This bias is robust, with models including sensitive attributes while correctly excluding other irrelevant features.

Why it matters

This paper reveals that current methods for evaluating bias in code generation are insufficient, dramatically underestimating real-world risks. It highlights the urgent need for more realistic benchmarks to prevent biased ML pipelines from being deployed.

Original Abstract

Prior work evaluates code generation bias primarily through simple conditional statements, which represent only a narrow slice of real-world programming and reveal solely overt, explicitly encoded bias. We demonstrate that this approach dramatically underestimates bias in practice by examining a more realistic task: generating machine learning (ML) pipelines. Testing both code-specialized and general-instruction large language models, we find that generated pipelines exhibit significant bias during feature selection. Sensitive attributes appear in 87.7% of cases on average, despite models demonstrably excluding irrelevant features (e.g., including "race" while dropping "favorite color" for credit scoring). This bias is substantially more prevalent than that captured by conditional statements, where sensitive attributes appear in only 59.2% of cases. These findings are robust across prompt mitigation strategies, varying numbers of attributes, and different pipeline difficulty levels. Our results challenge simple conditionals as valid proxies for bias evaluation and suggest current benchmarks underestimate bias risk in practical deployments.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers