ArXiv TLDR

PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI

🐦 Tweet
2605.05682

Wesley Hanwen Deng, Mingxi Yan, Sunnie S. Y. Kim, Akshita Jha, Lauren Wilcox + 3 more

cs.HCcs.AIcs.CY

TLDR

PersonaTeaming introduces persona-driven red-teaming, enhancing both automated and human-AI collaborative methods for identifying generative AI risks.

Key contributions

  • Developed PersonaTeaming Workflow, an automated method using personas for adversarial prompt generation.
  • PersonaTeaming Workflow outperforms state-of-the-art RainbowPlus in attack success rate and prompt diversity.
  • Introduced PersonaTeaming Playground, a human-AI interface for red-teamers to author personas and refine prompts.
  • User study shows Playground enables diverse strategies, useful outputs, and encourages out-of-the-box thinking.

Why it matters

This paper significantly advances AI safety by integrating human perspectives into red-teaming. It offers both automated and human-in-the-loop tools, making risk discovery more comprehensive and effective. The insights on human-AI collaboration are crucial for future safety tool design.

Original Abstract

Recent developments in AI safety research have called for red-teaming methods that effectively surface potential risks posed by generative AI models, with growing emphasis on how red-teamers' backgrounds and perspectives shape their strategies and the risks they uncover. While automated red-teaming approaches promise to complement human red-teaming through larger-scale exploration, existing automated approaches do not account for human identities and rarely incorporate human inputs. In this work, we explore persona-driven red-teaming to advance both automated red-teaming and human-AI collaboration. We first develop PersonaTeaming Workflow, which incorporates personas into the adversarial prompt generation process to explore a wider spectrum of adversarial strategies. Compared to RainbowPlus, a state-of-the-art automated red-teaming method, PersonaTeaming Workflow achieves higher attack success rates while maintaining prompt diversity. However, since automated personas only approximate real human perspectives, we further instantiate PersonaTeaming Workflow as PersonaTeaming Playground, a user-facing interface that enables red-teamers to author their own personas and collaborate with AI to mutate and refine prompts. In a user study with 11 industry practitioners, we found that PersonaTeaming Playground enabled diverse red-teaming strategies and outputs that practitioners perceived as useful, and that AI-generated suggestions in the PersonaTeaming Playground encouraged out-of-the-box thinking even when practitioners did not follow them strictly. Together, our work advances both automated and human-in-the-loop approaches to red-teaming, while shedding light on interaction patterns and design insights for supporting human-AI collaboration in generative AI red-teaming.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.