Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer + 6 more
TLDR
Llama Guard is an LLM-based safeguard model that classifies and moderates both prompts and responses in human-AI conversations to enhance safety and content moderation.
Key contributions
- Introduces a safety risk taxonomy for categorizing risks in LLM prompts and responses.
- Developed a high-quality dataset for prompt and response classification.
- Llama2-7b based model matches or outperforms existing moderation tools on benchmarks like OpenAI Moderation Evaluation and ToxicChat.
- Supports multi-class classification and binary decision scoring with customizable instruction fine-tuning for flexible taxonomy adaptation.
- Open-sourced model weights to foster community-driven improvements in AI safety.
Why it matters
As AI language models become increasingly integrated into human interactions, ensuring safe and responsible outputs is critical. Llama Guard addresses this by providing a robust, adaptable moderation framework that not only detects unsafe prompts but also evaluates generated responses, improving overall conversational safety. Its open availability empowers researchers and developers to tailor safety measures to diverse applications, advancing the field of AI safety and trustworthiness.
Original Abstract
We introduce Llama Guard, an LLM-based input-output safeguard model geared towards Human-AI conversation use cases. Our model incorporates a safety risk taxonomy, a valuable tool for categorizing a specific set of safety risks found in LLM prompts (i.e., prompt classification). This taxonomy is also instrumental in classifying the responses generated by LLMs to these prompts, a process we refer to as response classification. For the purpose of both prompt and response classification, we have meticulously gathered a dataset of high quality. Llama Guard, a Llama2-7b model that is instruction-tuned on our collected dataset, albeit low in volume, demonstrates strong performance on existing benchmarks such as the OpenAI Moderation Evaluation dataset and ToxicChat, where its performance matches or exceeds that of currently available content moderation tools. Llama Guard functions as a language model, carrying out multi-class classification and generating binary decision scores. Furthermore, the instruction fine-tuning of Llama Guard allows for the customization of tasks and the adaptation of output formats. This feature enhances the model's capabilities, such as enabling the adjustment of taxonomy categories to align with specific use cases, and facilitating zero-shot or few-shot prompting with diverse taxonomies at the input. We are making Llama Guard model weights available and we encourage researchers to further develop and adapt them to meet the evolving needs of the community for AI safety.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.