ArXiv TLDR

Yifan Yang

8 papers ยท Latest:

Computer Vision

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

This paper introduces CUActSpot, a new benchmark and data synthesis method to improve computer-use agents' reliability on complex, diverse interactions.

2605.12501
Natural Language Processing

MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering

MedHopQA is a new disease-centered multi-hop reasoning benchmark for evaluating LLMs in biomedical QA, designed to resist saturation and contamination.

2605.12361
Robotics

MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents

MemCompiler dynamically compiles state-conditioned memory for embodied agents, improving performance and efficiency over static memory injection.

2605.07594
Artificial Intelligence

SOAR: Real-Time Joint Optimization of Order Allocation and Robot Scheduling in Robotic Mobile Fulfillment Systems

SOAR uses Deep Reinforcement Learning for real-time joint optimization of order allocation and robot scheduling in robotic mobile fulfillment systems.

2605.03842
Computer Vision

Toward Multimodal Conversational AI for Age-Related Macular Degeneration

OcularChat, a new multimodal LLM, accurately diagnoses age-related macular degeneration (AMD) from fundus photos with clinical reasoning and interactive dialogue.

2604.25720
Computer Vision

World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

World-R1 uses reinforcement learning to enforce 3D constraints in text-to-video generation, improving geometric consistency without architectural changes.

2604.24764
Computer Vision

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

MM-WebAgent is a hierarchical multimodal agent that generates coherent and visually consistent webpages by coordinating AIGC elements through planning and self-reflection.

2604.15309
Computer Vision

AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

AVGen-Bench introduces a new benchmark and multi-granular evaluation for Text-to-Audio-Video generation, revealing gaps in semantic reliability.

2604.08540

๐Ÿ“ฌ Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week โ€” summarized, scored, and delivered to your inbox every Monday.