Xiaomin Li
2 papers ยท Latest:
Software Engineering
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
AgentLens reveals the 'Lucky Pass' problem in SWE-agent evaluation, introducing a process-level framework to assess trajectory quality beyond simple pass/fail.
2605.12925
Computer VisionSeek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios
DailyClue is a new benchmark for MLLMs that evaluates their ability to perform visual clue-driven reasoning in complex, real-world daily scenarios.
2604.14041
๐ฌ Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week โ summarized, scored, and delivered to your inbox every Monday.