Computer Vision
Papers on image recognition, object detection, video analysis, and visual understanding.
cs.CV ยท 703 papersBenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD
BenchCAD is a new industry-standard benchmark for evaluating MLLMs on generating executable parametric CAD programs, revealing current models' limitations.
Masked Generative Transformer Is What You Need for Image Editing
EditMGT, a novel Masked Generative Transformer, offers faster, more precise image editing by localizing changes, outperforming diffusion models.
Is Your Driving World Model an All-Around Player?
WorldLens is a new benchmark, dataset, and agent for evaluating driving world models beyond visual realism, focusing on physical and behavioral fidelity.
Verification Mirage: Mapping the Reliability Boundary of Self-Verification in Medical VQA
Self-verification in medical VQA is unreliable, often creating a "verification mirage" where models falsely confirm incorrect answers, especially in complex tasks.
BabelDOC: Better Layout-Preserving PDF Translation via Intermediate Representation
BabelDOC is an IR-based framework that accurately translates PDFs while preserving their original visual layout and improving terminology consistency.
Transcoda: End-to-End Zero-Shot Optical Music Recognition via Data-Centric Synthetic Training
Transcoda is a zero-shot OMR system using advanced synthetic data, normalized encodings, and grammar-based decoding to achieve state-of-the-art performance.
MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection
Introduces MMVIAD, the first continuous multi-view video dataset for industrial anomaly detection, and VISTA, a model outperforming GPT-5.4.
Predicting 3D structure by latent posterior sampling
This paper introduces a method for 3D structure prediction by combining NeRFs with diffusion models for probabilistic latent posterior sampling.
ALAM: Algebraically Consistent Latent Transitions for Vision-Language-Action Models
ALAM learns algebraically consistent latent transitions from action-free videos, significantly boosting VLA policy performance on complex robot manipulation tasks.
C-CoT: Counterfactual Chain-of-Thought with Vision-Language Models for Safe Autonomous Driving
C-CoT uses VLMs and counterfactual chain-of-thought to improve safe autonomous driving decisions, especially in complex, high-risk scenarios.
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
DeepSight improves end-to-end autonomous driving with a world model predicting long-horizon latent states and adaptive text reasoning.
Simultaneous Long-tailed Recognition and Multi-modal Fusion for Highly Imbalanced Multi-modal Data
This paper introduces a multi-modal fusion framework for long-tailed recognition in class-imbalanced data, outperforming single-modal methods.
Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception
Urban-ImageNet is a new 2M+ multi-modal dataset and benchmark for evaluating AI's perception of urban spaces using social media imagery.
LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering
LiteMedCoT-VL enables compact 2B models to achieve advanced medical VQA reasoning by distilling chain-of-thought from a 235B teacher.
MicroDiffuse3D: A Foundation Model for 3D Microscopy Imaging Restoration
MicroDiffuse3D is a foundation model that restores high-quality 3D microscopy images from degraded, low-resolution data, enabling faster volumetric chemical imaging.
123D: Unifying Multi-Modal Autonomous Driving Data at Scale
123D is an open-source framework that unifies diverse multi-modal autonomous driving datasets through a single API, enabling scalable data access.
Normalizing Trajectory Models
Normalizing Trajectory Models (NTM) use conditional normalizing flows for few-step diffusion, achieving high-quality samples with exact likelihood.
EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction
EmambaIR introduces an efficient visual State Space Model for event-guided image reconstruction, outperforming SOTA with reduced costs.
Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment
Proxy3D introduces efficient 3D representations for Vision-Language Models by using semantic-aware clustering of scene features from video frames.
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD introduces an on-policy distillation framework for Flow Matching text-to-image models, resolving multi-task alignment issues.
๐ฌ Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week โ summarized, scored, and delivered to your inbox every Monday.