ArXiv TLDR
← All categories

Computer Vision

Papers on image recognition, object detection, video analysis, and visual understanding.

cs.CV · 703 papers

ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation

ActCam enables zero-shot joint 3D motion and camera control for video generation, improving fidelity and camera adherence with staged guidance.

2605.06667May 7, 2026Omar El Khalifi, Thomas Rossi, Oscar Fossey +6

BAMI: Training-Free Bias Mitigation in GUI Grounding

BAMI is a training-free method that uses coarse-to-fine focus and candidate selection to mitigate precision and ambiguity biases in GUI grounding models.

2605.06664May 7, 2026Borui Zhang, Bo Zhang, Bo Wang +6

Relit-LiVE: Relight Video by Jointly Learning Environment Video

Relit-LiVE relights videos consistently and stably without camera pose, by using raw images and jointly predicting environment videos.

2605.06658May 7, 2026Weiqing Xiao, Hong Li, Xiuyu Yang +7

Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study

A new benchmark, MMDG-Bench, reveals that recent multimodal domain generalization methods offer only marginal gains over baselines and struggle with real-world challenges.

2605.06643May 7, 2026Hao Dong, Hongzhao Li, Shupan Li +3

GlazyBench: A Benchmark for Ceramic Glaze Property Prediction and Image Generation

GlazyBench introduces the first large-scale dataset (23,148 glazes) for AI-assisted ceramic glaze design, enabling property prediction and image generation.

2605.06641May 7, 2026Ziyu Zhai, Siyou Li, Juexi Shao +1

DPM++: Dynamic Masked Metric Learning for Occluded Person Re-identification

DPM++ introduces dynamic masked metric learning with CLIP-based supervision and saliency-guided data augmentation for robust occluded person re-identification.

2605.06637May 7, 2026Lei Tan, Yingshi Luan, Pincong Zou +2

SoftSAE: Dynamic Top-K Selection for Adaptive Sparse Autoencoders

SoftSAE introduces a dynamic Top-K selection mechanism for sparse autoencoders, adapting feature sparsity to input complexity for better interpretability.

2605.06610May 7, 2026Jakub Stępień, Marcin Mazur, Jacek Tabor +1

DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency

DINORANKCLIP enhances vision-language pretraining by integrating a DINOv3 teacher for local structure and a novel high-order ranking consistency loss.

2605.06592May 7, 2026Shuyang Jiang, Nan Yu, Yiming Zhang +2

Solving Minimal Problems Without Matrix Inversion Using FFT-Based Interpolation

This paper introduces an FFT-based interpolation method to solve minimal problems without matrix inversion, offering a stable and fast alternative.

2605.06572May 7, 2026Haidong Wu, Snehal Bhayani, Janne Heikkilä

Continuous Latent Diffusion Language Model

Cola DLM is a hierarchical latent diffusion language model that generates text by modeling global semantics in a continuous latent space, offering a flexible non-autoregressive approach.

2605.06548May 7, 2026Hongcan Guo, Qinyu Zhao, Yian Zhao +8

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

MedHorizon introduces a new benchmark for long-context medical video understanding, revealing current MLLMs struggle with sparse evidence retrieval and clinical reasoning.

2605.06537May 7, 2026Bodong Du, Bowen Liu, Yang Yu +8

Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance

Sparkle introduces a new dataset and benchmark for high-quality video background replacement, significantly improving model performance.

2605.06535May 7, 2026Ziyun Zeng, Yiqi Lin, Guoqiang Liang +1

Agentic AIs Are the Missing Paradigm for Out-of-Distribution Generalization in Foundation Models

Agentic AIs are the missing paradigm for robust out-of-distribution generalization in foundation models, overcoming limitations of model-centric approaches.

2605.06522May 7, 2026Xin Wang, Haibo Chen, Wenxuan Liu +1

DCR: Counterfactual Attractor Guidance for Rare Compositional Generation

DCR is a training-free method that uses counterfactual attractor guidance to prevent diffusion models from collapsing on rare compositional prompts.

2605.06512May 7, 2026Taewon Kang, Matthias Zwicker

FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction

FreeSpec introduces a training-free method for long video generation, leveraging singular-spectrum reconstruction to overcome temporal inconsistencies.

2605.06509May 7, 2026Fangda Chen, Shanshan Zhao, Longrong Yang +3

MARBLE: Multi-Aspect Reward Balance for Diffusion RL

MARBLE introduces a gradient-space optimization framework to balance multiple rewards for diffusion RL, improving all dimensions simultaneously without manual weighting.

2605.06507May 7, 2026Canyu Zhao, Hao Chen, Yunze Tong +3

3D MRI Image Pretraining via Controllable 2D Slice Navigation Task

This paper introduces a novel self-supervised pretraining method for 3D MRI by converting volumes into controllable 2D slice navigation sequences.

2605.06487May 7, 2026Yu Wang, Qingchao Chen

GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs

GeoStack is a modular framework for VLMs that composes independently trained experts, mitigating catastrophic forgetting with constant-time inference.

2605.06477May 7, 2026Pranav Mantini, Shishir K. Shah

Hyperbolic Concept Bottleneck Models

Hyperbolic Concept Bottleneck Models (HypCBM) improve interpretability by embedding concepts in hyperbolic space, leveraging semantic hierarchies.

2605.06440May 7, 2026Daniel Uyterlinde, Swasti Shreya Mishra, Pascal Mettes

From Review to Design: Ethical Multimodal Driver Monitoring Systems for Risk Mitigation, Incident Response, and Accountability in Automated Vehicles

This paper proposes an ethical design framework for multimodal Driver Monitoring Systems (DMS) in automated vehicles, addressing privacy, fairness, and accountability.

2605.06439May 7, 2026Bilal Khana, Waseem Shariff, Rory Coyne +2
PreviousPage 6 of 36Next

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.