ArXiv TLDR
← All categories

Computer Vision

Papers on image recognition, object detection, video analysis, and visual understanding.

cs.CV · 703 papers

6D Pose Estimation via Keypoint Heatmap Regression with RGB-D Residual Neural Networks

This paper proposes a 6D pose estimation framework using keypoint heatmap regression, achieving high accuracy with RGB-D fusion.

2605.08059May 8, 2026Ismail Aljosevic, Amir Masoud Almasi, Ana Parovic +1

Towards Highly-Constrained Human Motion Generation with Retrieval-Guided Diffusion Noise Optimization

This paper introduces a retrieval-guided diffusion noise optimization method to generate human motion under highly challenging spatiotemporal constraints.

2605.08054May 8, 2026Hanchao Liu, Fang-Lue Zhang, Shining Zhang +2

MoCoTalk: Multi-Conditional Diffusion with Adaptive Router for Controllable Talking Head Generation

MoCoTalk is a multi-conditional diffusion framework that unifies four control signals for state-of-the-art, controllable talking head generation.

2605.08050May 8, 2026Xinyan Ye, Jiankang Deng, Abbas Edalat

SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

SCOPE is a framework that uses structured decomposition and conditional skill orchestration to maintain semantic commitments for complex text-to-image generation.

2605.08043May 8, 2026Tianfei Ren, Zhipeng Yan, Yiming Zhao +13

Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models

HFRU is a reinforcement unlearning framework for VLMs that deeply removes sensitive visual knowledge from the vision encoder, preventing object hallucination.

2605.08031May 8, 2026Kaidi Jia, Yujie Lin, Chengyi Yang +2

PET-Adapter: Test-Time Domain Adaptation for Full and Limited-Angle PET Image Reconstruction

PET-Adapter is a test-time domain adaptation framework that improves PET image reconstruction from phantom-trained models to diverse clinical data.

2605.08030May 8, 2026Rüveyda Yilmaz, Yuli Wu, Johannes Stegmaier +1

STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

STARFlow2 unifies multimodal generation by using autoregressive normalizing flows, which naturally align with LLMs, for interleaved text and image processing.

2605.08029May 8, 2026Ying Shen, Tianrong Chen, Yuan Gao +6

TRAS: An Interactive Software for Tracing Tree Ring Cross Sections

TRAS is an open-source software that automates tree ring detection and measurement, significantly reducing manual effort for dendrochronology.

2605.08025May 8, 2026Henry Marichal, Diego Passarella, Gregory Randall

SphereVAD: Training-Free Video Anomaly Detection via Geodesic Inference on the Unit Hypersphere

SphereVAD offers training-free video anomaly detection by leveraging pre-trained MLLM features and geometric inference on a unit hypersphere.

2605.08003May 8, 2026Chao Huang, Penfei Wei, Wei Wang +5

Rethinking Dense Optical Flow without Test-Time Scaling

This paper proposes a single-pass optical flow method leveraging foundation models to achieve strong performance without computationally expensive test-time scaling.

2605.08000May 8, 2026Praroop Chanda, Suryansh Kumar

Uncertainty Quantification for Cardiac Shape Reconstruction with Deep Signed Distance Functions via MCMC methods

This paper introduces a probabilistic framework for uncertainty-aware cardiac shape reconstruction using DeepSDFs and MCMC, providing accurate results.

2605.07987May 8, 2026Jan Verhülsdonk, Thomas Grandits, Francisco Sahli Costabal +3

Seeing Across Skies and Streets: Feedforward 3D Reconstruction from Satellite, Drone, and Ground Images

Cross3R reconstructs 3D scenes and camera poses from satellite, drone, and ground images, overcoming limitations of traditional cross-view localization.

2605.07978May 8, 2026Qiwei Wang, Zhongyao Tuo, Xianghui Ze +1

HEART: Hyperspherical Embedding Alignment via Kent-Representation Traversal in Diffusion Models

HEART uses hyperspherical embeddings and Kent distributions to enable precise, training-free control over text-to-image diffusion models, preserving scene details.

2605.07973May 8, 2026Arani Roy, Shristi Das Biswas, Kaushik Roy

DVD: Discrete Voxel Diffusion for 3D Generation and Editing

DVD is a discrete diffusion framework for 3D generation and editing of sparse voxels, offering improved interpretability and direct discrete modeling.

2605.07971May 8, 2026Zhengrui Xiang, Jiaqi Wu, Fupeng Sun +2

TimeLesSeg: Unified Contrast-Agnostic Cross-Sectional and Longitudinal MS Lesion Segmentation via a Stochastic Generative Model

TimeLesSeg unifies contrast-agnostic cross-sectional and longitudinal MS lesion segmentation using a stochastic generative model.

2605.07955May 8, 2026Vicent Caselles-Ballester, Eloy Martínez-Heras, Giuseppe Pontillo +9

TAVIS: A Benchmark for Egocentric Active Vision and Anticipatory Gaze in Imitation Learning

TAVIS is a new benchmark for active vision in imitation learning, offering task suites and metrics to evaluate gaze control in robotic manipulation.

2605.07943May 8, 2026Giacomo Spigler

Text-to-CAD Evaluation with CADTests

Introduces CADTestBench, the first test-based benchmark using CADTests for evaluating and guiding Text-to-CAD model generation.

2605.07807May 8, 2026Dimitrios Mallis, Marco Wang, Ahmet Serdar Karadeniz +3

Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models

This paper demonstrates how Vision-Language Models can perform zero-shot perception of Operational Design Domain elements, enhancing safety for autonomous systems.

2605.07649May 8, 2026Berkehan Ünal, Dierend Hauke, Fazlija Dren +1

InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

InterLV-Search is a new benchmark for interleaved language-vision agentic search, revealing current multimodal agents struggle with complex visual evidence integration.

2605.07510May 8, 2026Bohan Hou, Jiuning Gu, Jiayan Guo +5

A Unified Measure-Theoretic View of Diffusion, Score-Based, and Flow Matching Generative Models

This paper unifies diffusion, score-based, and flow matching generative models under a measure-theoretic framework, clarifying their shared structure.

2605.06829May 7, 2026Aditya Ranganath, Mukesh Singhal
PreviousPage 5 of 36Next

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.