Jerry Wei

2 papers · Latest: April 30, 2026

Jailbroken Frontier Models Retain Their Capabilities

Advanced jailbreaks impose minimal capability degradation on frontier models, challenging assumptions about their safety.

A new streaming probing method for LLMs uses segment-level coherence to robustly detect harmful intent, especially in CBRN, reducing false alarms.

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.