ArXiv TLDR

CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation

🐦 Tweet
2605.10873

Anna C. Doris, Jacob Thomas Sony, Ghadi Nehme, Era Syla, Amin Heyrani Nobari + 1 more

cs.CVcs.AI

TLDR

CADBench is a new unified multimodal benchmark for evaluating AI models in generating editable CAD programs from various inputs.

Key contributions

  • Introduces CADBench, a unified benchmark with 18,000 samples across 6 families and 5 input modalities.
  • Uses 6 metrics (fidelity, executability, compactness) to evaluate CAD program generation performance.
  • Benchmarks 11 systems, showing specialized CAD models outperform general VLMs, which are unreliable.
  • Reveals failure modes: quality degrades with complexity, brittleness to modality shifts, and metric-dependent rankings.

Why it matters

This paper addresses the fragmented evaluation landscape in AI-assisted CAD program generation by introducing CADBench. It provides a comprehensive, multimodal benchmark to accurately measure progress and diagnose model limitations. The findings guide future research in editable 3D reconstruction and CAD understanding.

Original Abstract

Recovering editable CAD programs from images or 3D observations is central to AI-assisted design, but progress is difficult to measure because existing evaluations are fragmented across datasets, modalities, and metrics. We introduce CADBench, a unified benchmark for multimodal CAD program generation. CADBench contains 18,000 evaluation samples spanning six benchmark families derived from DeepCAD, Fusion 360, ABC, MCB, and Objaverse; five input modalities including clean meshes, noisy meshes, single-view renders, photorealistic renders, and multi-view renders; and six metrics covering geometric fidelity, executability, and program compactness. STEP-based families are stratified by B-rep face count and all families are diversity-sampled to support controlled analysis across complexity and object variation. We benchmark eleven CAD-specialized and general-purpose vision-language systems, generating more than 1.4 million CAD programs. Under idealized inputs, specialized mesh-to-CAD models substantially outperform code-generating VLMs, which remain far from reliable CAD program reconstruction. CADBench further reveals three recurring failure modes: reconstruction quality degrades with geometric complexity, CAD-specialized models can be brittle under modality shift, and model rankings change across metrics. Together, these results position CADBench as a diagnostic testbed for measuring progress in editable 3D reconstruction and multimodal CAD understanding. The benchmark is publicly available at https://huggingface.co/datasets/DeCoDELab/CADBench.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.