Computer Vision
Papers on image recognition, object detection, video analysis, and visual understanding.
cs.CV · 703 papersR-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
R-DMesh solves pose misalignment in video-guided 3D animation using a novel VAE and rectification offset for high-fidelity 4D mesh generation.
Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning
This paper introduces SPA, a novel method that unlocks and aligns CLIP's patch-level features with semantic descriptions for state-of-the-art class-incremental learning.
QLAM: A Quantum Long-Attention Memory Approach to Long-Sequence Token Modeling
QLAM introduces a quantum long-attention memory, extending state-space models to efficiently capture long-range dependencies using quantum superposition.
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
This paper introduces MMProLong, a new recipe for training long-context vision-language models effectively, generalizing beyond 128K context.
History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions
LLMs, especially flagship models, are highly susceptible to continuing and escalating harmful actions when instructed to maintain consistency with prior unsafe history.
OmniLiDAR: A Unified Diffusion Framework for Multi-Domain 3D LiDAR Generation
OmniLiDAR is a unified diffusion framework that generates 3D LiDAR scans across eight diverse domains using text conditioning, addressing single-domain limitations.
JANUS: Anatomy-Conditioned Gating for Robust CT Triage Under Distribution Shift
JANUS introduces a physiology-guided dual-stream architecture for robust CT triage, improving accuracy and reliability under distribution shifts.
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
EvoGround introduces self-evolving agents for video temporal grounding, achieving state-of-the-art results without human-labeled data.
VoxCor: Training-Free Volumetric Features for Multimodal Voxel Correspondence
VoxCor provides training-free volumetric features from frozen 2D ViTs for robust multimodal 3D medical image voxel correspondence.
BlitzGS: City-Scale Gaussian Splatting at Lightning Speed
BlitzGS is a distributed 3DGS framework for lightning-fast city-scale reconstruction, optimizing Gaussian workload across system, model, and view levels.
Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs
Realtime-VLA FLASH uses speculative inference with a lightweight draft model to significantly reduce latency in diffusion-based VLAs for real-time embodied tasks.
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data
RoboEvolve co-evolves a VLM planner and VGM simulator to overcome data scarcity in robotic manipulation, achieving high efficiency with limited unlabeled data.
Generative Texture Diversification of 3D Pedestrians for Robust Autonomous Driving Perception
Generates diverse 3D pedestrian textures using StyleGAN2 for synthetic data, enhancing autonomous driving perception robustness.
Min Generalized Sliced Gromov Wasserstein: A Scalable Path to Gromov Wasserstein
min-GSGW offers a scalable, rigid-motion invariant method for Gromov-Wasserstein by using generalized slicers and an amortized variant.
Weakly-Supervised Spatiotemporal Anomaly Detection
This paper introduces a weakly-supervised spatiotemporal anomaly detection method that uses video-level labels and multiple instance ranking loss.
Aligning Network Equivariance with Data Symmetry: A Theoretical Framework and Adaptive Approach for Image Restoration
This paper introduces a theoretical framework and adaptive network for image restoration, aligning network equivariance with data symmetry to improve performance.
LEXI-SG: Monocular 3D Scene Graph Mapping with Room-Guided Feed-Forward Reconstruction
LEXI-SG is the first dense monocular visual mapping system for open-vocabulary 3D scene graphs using only RGB camera input, enabling scalable reconstruction.
Robust and Explainable Bicuspid Aortic Valve Diagnosis Using Stacked Ensembles on Echocardiography
An explainable AI model accurately diagnoses bicuspid aortic valve (BAV) from tricuspid aortic valve (TAV) using routine echocardiography.
Coordinating Multiple Conditions for Trajectory-Controlled Human Motion Generation
CMC is a decoupled framework that generates human motions from text and trajectories, resolving conflicts and improving control accuracy.
AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
AnyFlow introduces an any-step video diffusion model using flow map distillation, outperforming consistency-based methods and scaling with sampling steps.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.