Xuezhi Cao

3 papers · Latest: May 13, 2026

SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle

SWE-Cycle introduces a new benchmark and SWE-Judge evaluation system to accurately assess autonomous code agents across the complete software issue resolution cycle.

2605.13139May 13, 2026

Natural Language Processing

General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks

General365 is a new benchmark assessing LLMs' general reasoning, revealing their domain-dependent abilities and significant room for improvement beyond specialized tasks.

2604.11778Apr 13, 2026

Computer Vision

LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment

LARY introduces a benchmark and dataset for evaluating latent action representations, showing general visual models excel and latent spaces align better with physical actions.

2604.11689Apr 13, 2026

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.