ArXiv TLDR
← All categories

Computer Vision

Papers on image recognition, object detection, video analysis, and visual understanding.

cs.CV · 703 papers

Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

GAP proposes a granular alignment paradigm to stabilize visual latent reasoning in MLLMs by addressing feature-space mismatches, improving performance.

2605.12374May 12, 2026Yanting Miao, Yutao Sun, Dexin Wang +8

VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference

VIP enhances dino.txt for efficient, high-quality open-vocabulary semantic segmentation by evolving text prompts with visual guidance.

2605.12325May 12, 2026Hao Zhu, Shuo Jin, Wenbin Liao +4

From Model Uncertainty to Human Attention: Localization-Aware Visual Cues for Scalable Annotation Review

This paper introduces visual cues for spatial uncertainty in AI-assisted annotation, improving label quality and speed by guiding human attention.

2605.12303May 12, 2026Moussa Kassem Sbeyti, Joshua Holstein, Philipp Spitzer +2

EgoEV-HandPose: Egocentric 3D Hand Pose Estimation and Gesture Recognition with Stereo Event Cameras

EgoEV-HandPose uses stereo event cameras and a new dataset for robust egocentric 3D hand pose estimation and gesture recognition, outperforming RGB.

2605.12297May 12, 2026Luming Wang, Hao Shi, Jiajun Zhai +2

TriBand-BEV: Real-Time LiDAR-Only 3D Pedestrian Detection via Height-Aware BEV and High-Resolution Feature Fusion

TriBand-BEV introduces a real-time LiDAR-only 3D pedestrian detection method using a height-aware BEV encoding, outperforming prior methods on KITTI.

2605.12220May 12, 2026Mohammad Khoshkdahan, Alexey Vinel

From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

MoLA transforms imagined robot manipulation videos into executable actions by inferring a mixture of latent actions via inverse dynamics models.

2605.12167May 12, 2026Yajie Li, Bozhou Zhang, Chun Gu +5

Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models

This paper introduces Uni-AdGen, a unified autoregressive model for personalized image and text ad generation, improving realism and user preference.

2605.12138May 12, 2026Yexing Xu, Wei Feng, Shen Zhang +15

World Action Models: The Next Frontier in Embodied AI

This survey introduces World Action Models (WAMs), a new embodied AI paradigm unifying predictive state modeling with action generation, providing a systematic overview.

2605.12090May 12, 2026Siyin Wang, Junhao Shi, Zhaoyang Fu +11

Very Efficient Listwise Multimodal Reranking for Long Documents

ZipRerank is a highly efficient listwise multimodal reranker that significantly speeds up M-RAG for long documents by reducing input length and eliminating autoregressive decoding.

2605.11864May 12, 2026Yiqun Sun, Pengfei Wei, Lawrence B. Hsieh

One-Step Generative Modeling via Wasserstein Gradient Flows

W-Flow introduces a novel one-step generative model using Wasserstein gradient flows, achieving state-of-the-art image generation 100x faster than diffusion models.

2605.11755May 12, 2026Jiaqi Han, Puheng Li, Qiushan Guo +3

Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

SLAS improves text-to-image models by using a novel super-linear advantage shaping to mitigate reward hacking and enhance training efficiency and robustness.

2605.10937May 11, 2026Haoyuan Sun, Jing Wang, Yuxin Song +9

Personal Visual Context Learning in Large Multimodal Models

This paper defines Personal VCL for LMMs, presents a benchmark, and proposes the Agentic Context Bank to enable personalized visual reasoning.

2605.10936May 11, 2026Zihui Xue, Ami Baid, Sangho Kim +2

Variational Inference for Lévy Process-Driven SDEs via Neural Tilting

This paper introduces a neural exponential tilting framework for variational inference in Lévy-driven SDEs, addressing challenges in modeling extreme events.

2605.10934May 11, 2026Yaman Kindap, Manfred Opper, Benjamin Dupuis +2

Pixal3D: Pixel-Aligned 3D Generation from Images

Pixal3D introduces a pixel-aligned 3D generation method that significantly improves fidelity for creating high-quality 3D assets from images.

2605.10922May 11, 2026Dong-Yang Li, Wang Zhao, Yuxin Chen +5

Confidence-Guided Diffusion Augmentation for Enhanced Bangla Compound Character Recognition

A new confidence-guided diffusion augmentation framework significantly boosts Bangla compound character recognition by synthesizing and filtering high-quality data.

2605.10916May 11, 2026Md. Sultan Al Rayhan, Maheen Islam

CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

CapVector learns transferable capability vectors in parametric space for VLA models, enhancing performance and reducing adaptation costs during finetuning.

2605.10903May 11, 2026Wenxuan Song, Han Zhao, Fuhao Li +7

Count Anything at Any Granularity

This paper introduces multi-grained open-world object counting, a new dataset (KubriCount), and a model (HieraCount) to improve counting accuracy.

2605.10887May 11, 2026Chang Liu, Haoning Wu, Weidi Xie

Geometry-aware Prototype Learning for Cross-domain Few-shot Medical Image Segmentation

GeoProto is a geometry-aware framework for cross-domain few-shot medical image segmentation, improving generalization by leveraging structural priors.

2605.10885May 11, 2026Feifan Song, Yuntian Bo, Haofeng Zhang

CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation

CADBench is a new unified multimodal benchmark for evaluating AI models in generating editable CAD programs from various inputs.

2605.10873May 11, 2026Anna C. Doris, Jacob Thomas Sony, Ghadi Nehme +3

BEACON: A Multimodal Dataset for Learning Behavioral Fingerprints from Gameplay Data

BEACON is a large, multimodal dataset from competitive Valorant gameplay for continuous authentication and behavioral fingerprinting research.

2605.10867May 11, 2026Ishpuneet Singh, Gursmeep Kaur, Uday Pratap Singh Atwal +3
PreviousPage 3 of 36Next

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.