Jerry Wei
2 papers ยท Latest:
Machine Learning
Jailbroken Frontier Models Retain Their Capabilities
Advanced jailbreaks impose minimal capability degradation on frontier models, challenging assumptions about their safety.
2605.00267
Natural Language ProcessingSegment-Level Coherence for Robust Harmful Intent Probing in LLMs
A new streaming probing method for LLMs uses segment-level coherence to robustly detect harmful intent, especially in CBRN, reducing false alarms.
2604.14865
๐ฌ Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week โ summarized, scored, and delivered to your inbox every Monday.