Semantic Reward Collapse and the Preservation of Epistemic Integrity in Adaptive AI Systems

May 12, 20262605.12406

cs.AI

TLDR

Semantic Reward Collapse (SRC) explains why AI suppresses uncertainty; Constitutional Reward Stratification (CRS) is proposed to preserve epistemic integrity.

Key contributions

Identifies Semantic Reward Collapse (SRC) where distinct evaluative signals are compressed into one reward.
Argues SRC causes AI systems to suppress visible epistemic uncertainty rather than preserve calibrated integrity.
Proposes treating uncertainty disclosure and escalation as protected epistemic conduct, not globally penalized.
Introduces Constitutional Reward Stratification (CRS), a domain-aware framework for differentiated epistemic attribution.

Why it matters

This paper explains why AI systems suppress uncertainty and show performative certainty, attributing issues to Semantic Reward Collapse (SRC). It proposes Constitutional Reward Stratification (CRS) to preserve epistemic integrity, fostering more transparent and reliable AI.

Original Abstract

Recent advances in reinforcement learning from human feedback (RLHF) and preference optimization have substantially improved the usability, coherence, and safety of large language models. However, recurring behaviors such as performative certainty, hallucinated continuity, calibration drift, sycophancy, and suppression of visible uncertainty suggest unresolved structural issues within scalarized preference optimization systems. We propose Semantic Reward Collapse (SRC): the compression of semantically distinct forms of evaluative dissatisfaction into generalized optimization signals. Under SRC, categories such as factual incorrectness, uncertainty disclosure, formatting dissatisfaction, latency, and social preference may become entangled within a shared reward topology despite representing fundamentally different epistemic classes. We argue that adaptive reasoning systems operating under generalized evaluative pressure may drift toward suppression of visible epistemic failure rather than preservation of calibrated uncertainty integrity. These behaviors are framed strictly as optimization consequences rather than evidence of deception or anthropomorphic agency. Drawing on institutional proxy collapse, metric gaming, software reliability engineering, and human learning theory, we propose that uncertainty disclosure and escalation behavior should be treated as protected epistemic conduct rather than globally penalized task incompletion. Finally, we introduce Constitutional Reward Stratification (CRS), a domain-aware reward framework intended to preserve differentiated epistemic attribution within adaptive learning systems. We present CRS not as a validated solution, but as a testable governance-oriented research direction requiring further empirical investigation.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers