Yifan Yang
8 papers ยท Latest:
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
This paper introduces CUActSpot, a new benchmark and data synthesis method to improve computer-use agents' reliability on complex, diverse interactions.
MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering
MedHopQA is a new disease-centered multi-hop reasoning benchmark for evaluating LLMs in biomedical QA, designed to resist saturation and contamination.
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents
MemCompiler dynamically compiles state-conditioned memory for embodied agents, improving performance and efficiency over static memory injection.
SOAR: Real-Time Joint Optimization of Order Allocation and Robot Scheduling in Robotic Mobile Fulfillment Systems
SOAR uses Deep Reinforcement Learning for real-time joint optimization of order allocation and robot scheduling in robotic mobile fulfillment systems.
Toward Multimodal Conversational AI for Age-Related Macular Degeneration
OcularChat, a new multimodal LLM, accurately diagnoses age-related macular degeneration (AMD) from fundus photos with clinical reasoning and interactive dialogue.
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
World-R1 uses reinforcement learning to enforce 3D constraints in text-to-video generation, improving geometric consistency without architectural changes.
MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
MM-WebAgent is a hierarchical multimodal agent that generates coherent and visually consistent webpages by coordinating AIGC elements through planning and self-reflection.
AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation
AVGen-Bench introduces a new benchmark and multi-granular evaluation for Text-to-Audio-Video generation, revealing gaps in semantic reliability.
๐ฌ Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week โ summarized, scored, and delivered to your inbox every Monday.