Three Models of RLHF Annotation: Extension, Evidence, and Authority
TLDR
This paper introduces three conceptual models (extension, evidence, authority) for the normative role of human judgments in RLHF, detailing their implications for annotation pipelines.
Key contributions
- Distinguishes three RLHF annotator models: extension (designer's view), evidence (independent facts), and authority (population's view).
- Explores how these models impact RLHF annotation pipelines, including solicitation, validation, and aggregation.
- Surveys existing RLHF literature to illustrate implicit model usage and common failure modes.
- Recommends decomposing annotation tasks by dimension and applying the most suitable model to each.
Why it matters
This paper clarifies the often-implicit normative role of human judgments in RLHF, which is crucial for effective alignment. By distinguishing these models, it helps pipeline designers avoid conflation and tailor annotation processes. This leads to more robust, ethical, and transparent large language model development.
Original Abstract
Preference-based alignment methods, most prominently Reinforcement Learning with Human Feedback (RLHF), use the judgments of human annotators to shape large language model behaviour. However, the normative role of these judgments is rarely made explicit. I distinguish three conceptual models of that role. The first is extension: annotators extend the system designers' own judgments about what outputs should be. The second is evidence: annotators provide independent evidence about some facts, whether moral, social or otherwise. The third is authority: annotators have some independent authority (as representatives of the broader population) to determine system outputs. I argue that these models have implications for how RLHF pipelines should solicit, validate and aggregate annotations. I survey landmark papers in the literature on RLHF and related methods to illustrate how they implicitly draw on these models, describe failure modes that come from unintentionally or intentionally conflating them, and offer normative criteria for choosing among them. My central recommendation is that RLHF pipeline designers should decompose annotation into separable dimensions and tailor each pipeline to the model most appropriate for that dimension, rather than seeking a single unified pipeline.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.