Three Models of RLHF Annotation: Extension, Evidence, and Authority

April 28, 20262604.25895

cs.CYcs.AIcs.CL

TLDR

This paper introduces three conceptual models (extension, evidence, authority) for the normative role of human judgments in RLHF, detailing their implications for annotation pipelines.

Key contributions

Distinguishes three RLHF annotator models: extension (designer's view), evidence (independent facts), and authority (population's view).
Explores how these models impact RLHF annotation pipelines, including solicitation, validation, and aggregation.
Surveys existing RLHF literature to illustrate implicit model usage and common failure modes.
Recommends decomposing annotation tasks by dimension and applying the most suitable model to each.

Why it matters

This paper clarifies the often-implicit normative role of human judgments in RLHF, which is crucial for effective alignment. By distinguishing these models, it helps pipeline designers avoid conflation and tailor annotation processes. This leads to more robust, ethical, and transparent large language model development.

Original Abstract

Preference-based alignment methods, most prominently Reinforcement Learning with Human Feedback (RLHF), use the judgments of human annotators to shape large language model behaviour. However, the normative role of these judgments is rarely made explicit. I distinguish three conceptual models of that role. The first is extension: annotators extend the system designers' own judgments about what outputs should be. The second is evidence: annotators provide independent evidence about some facts, whether moral, social or otherwise. The third is authority: annotators have some independent authority (as representatives of the broader population) to determine system outputs. I argue that these models have implications for how RLHF pipelines should solicit, validate and aggregate annotations. I survey landmark papers in the literature on RLHF and related methods to illustrate how they implicitly draw on these models, describe failure modes that come from unintentionally or intentionally conflating them, and offer normative criteria for choosing among them. My central recommendation is that RLHF pipeline designers should decompose annotation into separable dimensions and tailor each pipeline to the model most appropriate for that dimension, rather than seeking a single unified pipeline.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers