Computer Vision
Papers on image recognition, object detection, video analysis, and visual understanding.
cs.CV · 703 papers6D Pose Estimation via Keypoint Heatmap Regression with RGB-D Residual Neural Networks
This paper proposes a 6D pose estimation framework using keypoint heatmap regression, achieving high accuracy with RGB-D fusion.
Towards Highly-Constrained Human Motion Generation with Retrieval-Guided Diffusion Noise Optimization
This paper introduces a retrieval-guided diffusion noise optimization method to generate human motion under highly challenging spatiotemporal constraints.
MoCoTalk: Multi-Conditional Diffusion with Adaptive Router for Controllable Talking Head Generation
MoCoTalk is a multi-conditional diffusion framework that unifies four control signals for state-of-the-art, controllable talking head generation.
SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation
SCOPE is a framework that uses structured decomposition and conditional skill orchestration to maintain semantic commitments for complex text-to-image generation.
Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models
HFRU is a reinforcement unlearning framework for VLMs that deeply removes sensitive visual knowledge from the vision encoder, preventing object hallucination.
PET-Adapter: Test-Time Domain Adaptation for Full and Limited-Angle PET Image Reconstruction
PET-Adapter is a test-time domain adaptation framework that improves PET image reconstruction from phantom-trained models to diverse clinical data.
STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation
STARFlow2 unifies multimodal generation by using autoregressive normalizing flows, which naturally align with LLMs, for interleaved text and image processing.
TRAS: An Interactive Software for Tracing Tree Ring Cross Sections
TRAS is an open-source software that automates tree ring detection and measurement, significantly reducing manual effort for dendrochronology.
SphereVAD: Training-Free Video Anomaly Detection via Geodesic Inference on the Unit Hypersphere
SphereVAD offers training-free video anomaly detection by leveraging pre-trained MLLM features and geometric inference on a unit hypersphere.
Rethinking Dense Optical Flow without Test-Time Scaling
This paper proposes a single-pass optical flow method leveraging foundation models to achieve strong performance without computationally expensive test-time scaling.
Uncertainty Quantification for Cardiac Shape Reconstruction with Deep Signed Distance Functions via MCMC methods
This paper introduces a probabilistic framework for uncertainty-aware cardiac shape reconstruction using DeepSDFs and MCMC, providing accurate results.
Seeing Across Skies and Streets: Feedforward 3D Reconstruction from Satellite, Drone, and Ground Images
Cross3R reconstructs 3D scenes and camera poses from satellite, drone, and ground images, overcoming limitations of traditional cross-view localization.
HEART: Hyperspherical Embedding Alignment via Kent-Representation Traversal in Diffusion Models
HEART uses hyperspherical embeddings and Kent distributions to enable precise, training-free control over text-to-image diffusion models, preserving scene details.
DVD: Discrete Voxel Diffusion for 3D Generation and Editing
DVD is a discrete diffusion framework for 3D generation and editing of sparse voxels, offering improved interpretability and direct discrete modeling.
TimeLesSeg: Unified Contrast-Agnostic Cross-Sectional and Longitudinal MS Lesion Segmentation via a Stochastic Generative Model
TimeLesSeg unifies contrast-agnostic cross-sectional and longitudinal MS lesion segmentation using a stochastic generative model.
TAVIS: A Benchmark for Egocentric Active Vision and Anticipatory Gaze in Imitation Learning
TAVIS is a new benchmark for active vision in imitation learning, offering task suites and metrics to evaluate gaze control in robotic manipulation.
Text-to-CAD Evaluation with CADTests
Introduces CADTestBench, the first test-based benchmark using CADTests for evaluating and guiding Text-to-CAD model generation.
Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models
This paper demonstrates how Vision-Language Models can perform zero-shot perception of Operational Design Domain elements, enhancing safety for autonomous systems.
InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search
InterLV-Search is a new benchmark for interleaved language-vision agentic search, revealing current multimodal agents struggle with complex visual evidence integration.
A Unified Measure-Theoretic View of Diffusion, Score-Based, and Flow Matching Generative Models
This paper unifies diffusion, score-based, and flow matching generative models under a measure-theoretic framework, clarifying their shared structure.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.