Chao Yang
2 papers ยท Latest:
Cryptography & Security
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
SkillSafetyBench evaluates how reusable skills in LLM agents create new attack surfaces, revealing vulnerabilities beyond model-level alignment.
2605.12015
Machine LearningBeyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization
MODPO is a novel, RL-free method for aligning language models to multiple human preferences simultaneously, achieving stable and efficient optimization across diverse objectives.
2310.03708
๐ฌ Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week โ summarized, scored, and delivered to your inbox every Monday.