PolicyLLM: Towards Excellent Comprehension of Public Policy for Large Language Models
Han Bao, Penghao Zhang, Yue Huang, Zhengqing Yuan, Yanchi Ru + 7 more
TLDR
PolicyLLM introduces PolicyBench, a cross-system benchmark, and PolicyMoE, an MoE model, to evaluate and enhance LLM comprehension of public policy.
Key contributions
- Introduces PolicyBench, the first large-scale cross-system benchmark (US-China) for public policy comprehension.
- PolicyBench evaluates LLMs across 21K cases, assessing Memorization, Understanding, and Application via Bloom's taxonomy.
- Proposes PolicyMoE, a domain-specialized Mixture-of-Experts model with modules for each cognitive level.
- PolicyMoE demonstrates stronger performance on application-oriented policy tasks and structured reasoning.
Why it matters
LLMs are increasingly used in public policy, making their reliable comprehension critical. This paper provides the first comprehensive benchmark and a specialized model to address this gap. It highlights current LLM limitations and paves the way for more robust policy-focused AI.
Original Abstract
Large Language Models (LLMs) are increasingly integrated into real-world decision-making, including in the domain of public policy. Yet, their ability to comprehend and reason about policy-related content remains underexplored. To fill this gap, we present \textbf{\textit{PolicyBench}}, the first large-scale cross-system benchmark (US-China) evaluating policy comprehension, comprising 21K cases across a broad spectrum of policy areas, capturing the diversity and complexity of real-world governance. Following Bloom's taxonomy, the benchmark assesses three core capabilities: (1) \textbf{Memorization}: factual recall of policy knowledge, (2) \textbf{Understanding}: conceptual and contextual reasoning, and (3) \textbf{Application}: problem-solving in real-life policy scenarios. Building on this benchmark, we further propose \textbf{\textit{PolicyMoE}}, a domain-specialized Mixture-of-Experts (MoE) model with expert modules aligned to each cognitive level. The proposed models demonstrate stronger performance on application-oriented policy tasks than on memorization or conceptual understanding, and yields the highest accuracy on structured reasoning tasks. Our results reveal key limitations of current LLMs in policy understanding and suggest paths toward more reliable, policy-focused models.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.