Persona-Conditioned Adversarial Prompting: Multi-Identity Red-Teaming for Adversarial Discovery and Mitigation
Cristian Morasso, Anisa Halimi, Muhammad Zaid Hameed, Douglas Leith
TLDR
PCAP uses diverse personas for red-teaming LLMs, significantly boosting attack success and generating robust defense data for improved safety.
Key contributions
- Introduces Persona-Conditioned Adversarial Prompting (PCAP) for diverse LLM red-teaming.
- PCAP conditions adversarial search on varied attacker personas and strategies.
- Increases attack success on GPT-OSS 120B from 57% to 97% and diversifies prompts 2-6x.
- PCAP-generated data significantly improves model robustness (F1: 0.53 -> 0.96) via fine-tuning.
Why it matters
This paper addresses the limitations of narrow automated red-teaming by introducing a persona-driven approach. It demonstrates a practical, closed-loop system for discovering diverse LLM vulnerabilities and effectively mitigating them.
Original Abstract
Automated red-teaming for LLMs often discovers narrow attack slices, missing diverse real-world threats, and yielding insufficient data for safety fine-tuning. We introduce Persona-Conditioned Adversarial Prompting (PCAP), which conditions adversarial search on diverse attacker personas (e.g., doctors, students, malicious actors) and strategy sets to explore realistic attack scenarios. By running parallel persona-conditioned searches, PCAP discovers transferable jailbreaks across different contexts and generates rich defense datasets with automatic metadata tracking. On GPT-OSS 120B, PCAP increases attack success from 57\% to 97\% while producing 2-6$\times$ more diverse prompts covering varied real-world scenarios. Critically, fine-tuning lightweight adapters on PCAP-generated data significantly improves model robustness (recall: 0.36 $\rightarrow$ 0.99, F1: 0.53 $\rightarrow$ 0.96) with minimal false positives, demonstrating a practical closed-loop approach from vulnerability discovery to automated alignment.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.