Computer Vision
Papers on image recognition, object detection, video analysis, and visual understanding.
cs.CV · 703 papersGuide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models
GTA-VLA is an interactive Vision-Language-Action framework that uses user-provided spatial guidance to improve robot reasoning and robustness in embodied tasks.
LoREnc: Low-Rank Encryption for Securing Foundation Models and LoRA Adapters
LoREnc is a training-free framework that secures foundation models and LoRA adapters against IP leakage and model recovery attacks with minimal overhead.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
SenseNova-U1 introduces a unified architecture (NEO-unify) that seamlessly integrates multimodal understanding and generation, outperforming specialized VLMs.
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
This paper introduces CUActSpot, a new benchmark and data synthesis method to improve computer-use agents' reliability on complex, diverse interactions.
EgoForce: Forearm-Guided Camera-Space 3D Hand Pose from a Monocular Egocentric Camera
EgoForce reconstructs absolute 3D hand pose from a single egocentric camera, robustly handling diverse head-mounted device configurations.
CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives
CausalCine is a real-time autoregressive framework for generating multi-shot video narratives, enabling interactive, coherent storytelling across shot changes.
From Web to Pixels: Bringing Agentic Search into Visual Perception
This paper introduces WebEye, a benchmark, and Pixel-Searcher, a model, for visual perception tasks requiring external knowledge and agentic search.
Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction
AmbiSuR improves Gaussian Splatting surface reconstruction by addressing photometric ambiguities with a novel disambiguation and self-indication module.
AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward
AlphaGRPO enhances multimodal generation in UMMs using GRPO and a novel Decompositional Verifiable Reward for self-reflection and reasoning.
Elastic Attention Cores for Scalable Vision Transformers
VECA introduces elastic core-periphery attention for Vision Transformers, achieving linear-time complexity and competitive performance with learned core tokens.
OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation
OmniNFT proposes a novel diffusion RL framework to improve joint audio-video generation by addressing multi-modal challenges like gradient imbalance.
FuTCR: Future-Targeted Contrast and Repulsion for Continual Panoptic Segmentation
FuTCR improves continual panoptic segmentation by pre-structuring representations for future classes, boosting new-class performance.
LychSim: A Controllable and Interactive Simulation Framework for Vision Research
LychSim is an interactive, controllable simulation framework built on Unreal Engine 5, simplifying complex simulation for vision research and LLM agents.
3D Gaussian Splatting for Efficient Retrospective Dynamic Scene Novel View Synthesis with a Standardized Benchmark
This paper achieves efficient retrospective dynamic scene novel view synthesis using 3D Gaussian Splatting in synchronized multi-view settings.
GaitProtector: Impersonation-Driven Gait De-Identification via Training-Free Diffusion Latent Optimization
GaitProtector uses a training-free diffusion method to de-identify gait by impersonating a target identity, balancing privacy with motion quality.
AOI-SSL: Self-Supervised Framework for Efficient Segmentation of Wire-bonded Semiconductors In Optical Inspection
AOI-SSL is a self-supervised framework for efficient semantic segmentation of wire-bonded semiconductors, reducing labeled data needs and improving adaptation.
Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images
MLLMs struggle with viewpoint-dependent spatial reasoning; a new benchmark, PCSR-Bench, reveals a significant perception-reasoning gap.
GeoQuery: Geometry-Query Diffusion for Sparse-View Reconstruction
GeoQuery improves sparse-view 3D reconstruction with 3D Gaussian Splatting by integrating geometry-guided diffusion and a novel cross-view attention mechanism.
SEMIR: Semantic Minor-Induced Representation Learning on Graphs for Visual Segmentation
SEMIR is a graph-based representation learning framework for visual segmentation that efficiently handles small, sparse structures by decoupling inference from the image grid.
Fast Image Super-Resolution via Consistency Rectified Flow
FlowSR achieves fast, high-quality single-step image super-resolution by reformulating the problem as a rectified flow with enhanced consistency.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.