Computer Vision
Papers on image recognition, object detection, video analysis, and visual understanding.
cs.CV · 703 papersActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation
ActCam enables zero-shot joint 3D motion and camera control for video generation, improving fidelity and camera adherence with staged guidance.
BAMI: Training-Free Bias Mitigation in GUI Grounding
BAMI is a training-free method that uses coarse-to-fine focus and candidate selection to mitigate precision and ambiguity biases in GUI grounding models.
Relit-LiVE: Relight Video by Jointly Learning Environment Video
Relit-LiVE relights videos consistently and stably without camera pose, by using raw images and jointly predicting environment videos.
Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study
A new benchmark, MMDG-Bench, reveals that recent multimodal domain generalization methods offer only marginal gains over baselines and struggle with real-world challenges.
GlazyBench: A Benchmark for Ceramic Glaze Property Prediction and Image Generation
GlazyBench introduces the first large-scale dataset (23,148 glazes) for AI-assisted ceramic glaze design, enabling property prediction and image generation.
DPM++: Dynamic Masked Metric Learning for Occluded Person Re-identification
DPM++ introduces dynamic masked metric learning with CLIP-based supervision and saliency-guided data augmentation for robust occluded person re-identification.
SoftSAE: Dynamic Top-K Selection for Adaptive Sparse Autoencoders
SoftSAE introduces a dynamic Top-K selection mechanism for sparse autoencoders, adapting feature sparsity to input complexity for better interpretability.
DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency
DINORANKCLIP enhances vision-language pretraining by integrating a DINOv3 teacher for local structure and a novel high-order ranking consistency loss.
Solving Minimal Problems Without Matrix Inversion Using FFT-Based Interpolation
This paper introduces an FFT-based interpolation method to solve minimal problems without matrix inversion, offering a stable and fast alternative.
Continuous Latent Diffusion Language Model
Cola DLM is a hierarchical latent diffusion language model that generates text by modeling global semantics in a continuous latent space, offering a flexible non-autoregressive approach.
MedHorizon: Towards Long-context Medical Video Understanding in the Wild
MedHorizon introduces a new benchmark for long-context medical video understanding, revealing current MLLMs struggle with sparse evidence retrieval and clinical reasoning.
Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance
Sparkle introduces a new dataset and benchmark for high-quality video background replacement, significantly improving model performance.
Agentic AIs Are the Missing Paradigm for Out-of-Distribution Generalization in Foundation Models
Agentic AIs are the missing paradigm for robust out-of-distribution generalization in foundation models, overcoming limitations of model-centric approaches.
DCR: Counterfactual Attractor Guidance for Rare Compositional Generation
DCR is a training-free method that uses counterfactual attractor guidance to prevent diffusion models from collapsing on rare compositional prompts.
FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction
FreeSpec introduces a training-free method for long video generation, leveraging singular-spectrum reconstruction to overcome temporal inconsistencies.
MARBLE: Multi-Aspect Reward Balance for Diffusion RL
MARBLE introduces a gradient-space optimization framework to balance multiple rewards for diffusion RL, improving all dimensions simultaneously without manual weighting.
3D MRI Image Pretraining via Controllable 2D Slice Navigation Task
This paper introduces a novel self-supervised pretraining method for 3D MRI by converting volumes into controllable 2D slice navigation sequences.
GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs
GeoStack is a modular framework for VLMs that composes independently trained experts, mitigating catastrophic forgetting with constant-time inference.
Hyperbolic Concept Bottleneck Models
Hyperbolic Concept Bottleneck Models (HypCBM) improve interpretability by embedding concepts in hyperbolic space, leveraging semantic hierarchies.
From Review to Design: Ethical Multimodal Driver Monitoring Systems for Risk Mitigation, Incident Response, and Accountability in Automated Vehicles
This paper proposes an ethical design framework for multimodal Driver Monitoring Systems (DMS) in automated vehicles, addressing privacy, fairness, and accountability.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.