Computer Vision

Papers on image recognition, object detection, video analysis, and visual understanding.

cs.CV · 703 papers

BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD

BenchCAD is a new industry-standard benchmark for evaluating MLLMs on generating executable parametric CAD programs, revealing current models' limitations.

2605.10865May 11, 2026Haozhe Zhang, Kaichen Liu, Miaomiao Chen +4

Masked Generative Transformer Is What You Need for Image Editing

EditMGT, a novel Masked Generative Transformer, offers faster, more precise image editing by localizing changes, outperforming diffusion models.

2605.10859May 11, 2026Wei Chow, Linfeng Li, Xian Sun +14

Is Your Driving World Model an All-Around Player?

WorldLens is a new benchmark, dataset, and agent for evaluating driving world models beyond visual realism, focusing on physical and behavioral fidelity.

2605.10858May 11, 2026Lingdong Kong, Ao Liang, Tianyi Yan +20

Verification Mirage: Mapping the Reliability Boundary of Self-Verification in Medical VQA

Self-verification in medical VQA is unreliable, often creating a "verification mirage" where models falsely confirm incorrect answers, especially in complex tasks.

2605.10850May 11, 2026Ruinan Jin, Beidi Zhao, Myeongkyun Kang +2

BabelDOC: Better Layout-Preserving PDF Translation via Intermediate Representation

BabelDOC is an IR-based framework that accurately translates PDFs while preserving their original visual layout and improving terminology consistency.

2605.10845May 11, 2026Qi Yang, Xiangyao Ma, Xiao Wang +2

Transcoda: End-to-End Zero-Shot Optical Music Recognition via Data-Centric Synthetic Training

Transcoda is a zero-shot OMR system using advanced synthetic data, normalized encodings, and grammar-based decoding to achieve state-of-the-art performance.

2605.10835May 11, 2026Daniel Dratschuk, Paul Swoboda

MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection

Introduces MMVIAD, the first continuous multi-view video dataset for industrial anomaly detection, and VISTA, a model outperforming GPT-5.4.

2605.10833May 11, 2026Xiran Zhao, Jing Jin, Yan Bai +6

Predicting 3D structure by latent posterior sampling

This paper introduces a method for 3D structure prediction by combining NeRFs with diffusion models for probabilistic latent posterior sampling.

2605.10830May 11, 2026Azmi Haider, Dan Rosenbaum

ALAM: Algebraically Consistent Latent Transitions for Vision-Language-Action Models

ALAM learns algebraically consistent latent transitions from action-free videos, significantly boosting VLA policy performance on complex robot manipulation tasks.

2605.10819May 11, 2026Zuojin Tang, Haoyun Liu, Xinyuan Chang +11

C-CoT: Counterfactual Chain-of-Thought with Vision-Language Models for Safe Autonomous Driving

C-CoT uses VLMs and counterfactual chain-of-thought to improve safe autonomous driving decisions, especially in complex, high-risk scenarios.

2605.10744May 11, 2026Kefei Tian, Yuansheng Lian, Kai Yang +2

DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving

DeepSight improves end-to-end autonomous driving with a world model predicting long-horizon latent states and adaptive text reasoning.

2605.10564May 11, 2026Lingjun Zhang, Changjie Wu, Linzhe Shi +6

Simultaneous Long-tailed Recognition and Multi-modal Fusion for Highly Imbalanced Multi-modal Data

This paper introduces a multi-modal fusion framework for long-tailed recognition in class-imbalanced data, outperforming single-modal methods.

2605.10498May 11, 2026Heegeon Yoon, Heeyoung Kim

Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception

Urban-ImageNet is a new 2M+ multi-modal dataset and benchmark for evaluating AI's perception of urban spaces using social media imagery.

2605.09936May 11, 2026Yiwei Ou, Chung Ching Cheung, Jun Yang Ang +5

LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering

LiteMedCoT-VL enables compact 2B models to achieve advanced medical VQA reasoning by distilling chain-of-thought from a 235B teacher.

2605.09384May 10, 2026Runze Ma, Shunbo Jia, Haonan Lyu +2

MicroDiffuse3D: A Foundation Model for 3D Microscopy Imaging Restoration

MicroDiffuse3D is a foundation model that restores high-quality 3D microscopy images from degraded, low-resolution data, enabling faster volumetric chemical imaging.

2605.08566May 8, 2026Yongkang Li, Brian Wong, King Wai Chiu +5

123D: Unifying Multi-Modal Autonomous Driving Data at Scale

123D is an open-source framework that unifies diverse multi-modal autonomous driving datasets through a single API, enabling scalable data access.

2605.08084May 8, 2026Daniel Dauner, Valentin Charraut, Bastian Berle +10

Normalizing Trajectory Models

Normalizing Trajectory Models (NTM) use conditional normalizing flows for few-step diffusion, achieving high-quality samples with exact likelihood.

2605.08078May 8, 2026Jiatao Gu, Tianrong Chen, Ying Shen +3

EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction

EmambaIR introduces an efficient visual State Space Model for event-guided image reconstruction, outperforming SOTA with reduced costs.

2605.08073May 8, 2026Wei Yu, Yunhang Qian

Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment

Proxy3D introduces efficient 3D representations for Vision-Language Models by using semantic-aware clustering of scene features from video frames.

2605.08064May 8, 2026Jerry Jiang, Haowen Sun, Denis Gudovskiy +4

Flow-OPD: On-Policy Distillation for Flow Matching Models

Flow-OPD introduces an on-policy distillation framework for Flow Matching text-to-image models, resolving multi-task alignment issues.

2605.08063May 8, 2026Zhen Fang, Wenxuan Huang, Yu Zeng +8

PreviousPage 4 of 36Next

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.