Many-Tier Instruction Hierarchy in LLM Agents
Jingyu Zhang, Tianjian Li, William Jurayj, Hongyuan Zhan, Benjamin Van Durme + 1 more
TLDR
This paper introduces ManyIH, a new paradigm and benchmark (ManyIH-Bench) to help LLM agents resolve conflicts from many instruction sources.
Key contributions
- Proposes Many-Tier Instruction Hierarchy (ManyIH) for resolving conflicts across many instruction privilege levels.
- Introduces ManyIH-Bench, the first benchmark for evaluating many-tier instruction conflict resolution.
- ManyIH-Bench includes 853 agentic tasks with up to 12 privilege levels, spanning 46 real-world agents.
- Finds frontier LLMs achieve only ~40% accuracy on ManyIH-Bench, showing a critical performance gap.
Why it matters
Current instruction hierarchy paradigms are insufficient for real-world LLM agents facing many conflicting instructions. This work provides a crucial framework and benchmark to address this gap, highlighting an urgent need for improved conflict resolution methods to ensure agent safety and effectiveness.
Original Abstract
Large language model agents receive instructions from many sources-system messages, user prompts, tool outputs, and more-each carrying different levels of trust and authority. When these instructions conflict, models must reliably follow the highest-privilege instruction to remain safe and effective. The dominant paradigm, instruction hierarchy (IH), assumes a fixed, small set of privilege levels (typically fewer than five) defined by rigid role labels (e.g., system > user). This is inadequate for real-world agentic settings, where conflicts can arise across far more sources and contexts. In this work, we propose Many-Tier Instruction Hierarchy (ManyIH), a paradigm for resolving instruction conflicts among instructions with arbitrarily many privilege levels. We introduce ManyIH-Bench, the first benchmark for ManyIH. ManyIH-Bench requires models to navigate up to 12 levels of conflicting instructions with varying privileges, comprising 853 agentic tasks (427 coding and 426 instruction-following). ManyIH-Bench composes constraints developed by LLMs and verified by humans to create realistic and difficult test cases spanning 46 real-world agents. Our experiments show that even the current frontier models perform poorly (~40% accuracy) when instruction conflict scales. This work underscores the urgent need for methods that explicitly target fine-grained, scalable instruction conflict resolution in agentic settings.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.