Computer Vision
Papers on image recognition, object detection, video analysis, and visual understanding.
cs.CV · 703 papersFill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
GAP proposes a granular alignment paradigm to stabilize visual latent reasoning in MLLMs by addressing feature-space mismatches, improving performance.
VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference
VIP enhances dino.txt for efficient, high-quality open-vocabulary semantic segmentation by evolving text prompts with visual guidance.
From Model Uncertainty to Human Attention: Localization-Aware Visual Cues for Scalable Annotation Review
This paper introduces visual cues for spatial uncertainty in AI-assisted annotation, improving label quality and speed by guiding human attention.
EgoEV-HandPose: Egocentric 3D Hand Pose Estimation and Gesture Recognition with Stereo Event Cameras
EgoEV-HandPose uses stereo event cameras and a new dataset for robust egocentric 3D hand pose estimation and gesture recognition, outperforming RGB.
TriBand-BEV: Real-Time LiDAR-Only 3D Pedestrian Detection via Height-Aware BEV and High-Resolution Feature Fusion
TriBand-BEV introduces a real-time LiDAR-only 3D pedestrian detection method using a height-aware BEV encoding, outperforming prior methods on KITTI.
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
MoLA transforms imagined robot manipulation videos into executable actions by inferring a mixture of latent actions via inverse dynamics models.
Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models
This paper introduces Uni-AdGen, a unified autoregressive model for personalized image and text ad generation, improving realism and user preference.
World Action Models: The Next Frontier in Embodied AI
This survey introduces World Action Models (WAMs), a new embodied AI paradigm unifying predictive state modeling with action generation, providing a systematic overview.
Very Efficient Listwise Multimodal Reranking for Long Documents
ZipRerank is a highly efficient listwise multimodal reranker that significantly speeds up M-RAG for long documents by reducing input length and eliminating autoregressive decoding.
One-Step Generative Modeling via Wasserstein Gradient Flows
W-Flow introduces a novel one-step generative model using Wasserstein gradient flows, achieving state-of-the-art image generation 100x faster than diffusion models.
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
SLAS improves text-to-image models by using a novel super-linear advantage shaping to mitigate reward hacking and enhance training efficiency and robustness.
Personal Visual Context Learning in Large Multimodal Models
This paper defines Personal VCL for LMMs, presents a benchmark, and proposes the Agentic Context Bank to enable personalized visual reasoning.
Variational Inference for Lévy Process-Driven SDEs via Neural Tilting
This paper introduces a neural exponential tilting framework for variational inference in Lévy-driven SDEs, addressing challenges in modeling extreme events.
Pixal3D: Pixel-Aligned 3D Generation from Images
Pixal3D introduces a pixel-aligned 3D generation method that significantly improves fidelity for creating high-quality 3D assets from images.
Confidence-Guided Diffusion Augmentation for Enhanced Bangla Compound Character Recognition
A new confidence-guided diffusion augmentation framework significantly boosts Bangla compound character recognition by synthesizing and filtering high-quality data.
CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models
CapVector learns transferable capability vectors in parametric space for VLA models, enhancing performance and reducing adaptation costs during finetuning.
Count Anything at Any Granularity
This paper introduces multi-grained open-world object counting, a new dataset (KubriCount), and a model (HieraCount) to improve counting accuracy.
Geometry-aware Prototype Learning for Cross-domain Few-shot Medical Image Segmentation
GeoProto is a geometry-aware framework for cross-domain few-shot medical image segmentation, improving generalization by leveraging structural priors.
CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation
CADBench is a new unified multimodal benchmark for evaluating AI models in generating editable CAD programs from various inputs.
BEACON: A Multimodal Dataset for Learning Behavioral Fingerprints from Gameplay Data
BEACON is a large, multimodal dataset from competitive Valorant gameplay for continuous authentication and behavioral fingerprinting research.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.