Xuezhi Cao
3 papers ยท Latest:
Software Engineering
SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle
SWE-Cycle introduces a new benchmark and SWE-Judge evaluation system to accurately assess autonomous code agents across the complete software issue resolution cycle.
2605.13139
Natural Language ProcessingGeneral365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks
General365 is a new benchmark assessing LLMs' general reasoning, revealing their domain-dependent abilities and significant room for improvement beyond specialized tasks.
2604.11778
Computer VisionLARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment
LARY introduces a benchmark and dataset for evaluating latent action representations, showing general visual models excel and latent spaces align better with physical actions.
2604.11689
๐ฌ Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week โ summarized, scored, and delivered to your inbox every Monday.