QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding

April 28, 20262604.25884

Shuxiang Cao, Zijian Zhang, Abhishek Agarwal, Grace Bratrud, Niyaz R. Beysengulov + 27 more

quant-phcs.CV

TLDR

QCalEval introduces the first benchmark for evaluating Vision-Language Models on quantum calibration plot understanding, revealing performance gaps.

Key contributions

Introduces QCalEval, the first VLM benchmark for quantum calibration plots with 243 samples across 87 types.
Evaluates VLMs on 6 question types, showing general-purpose models reach 72.3 zero-shot mean score.
Finds open-weight models degrade with multi-image in-context learning, while closed models improve.
Releases NVIDIA Ising Calibration 1, an open-weight model achieving 74.7 zero-shot average score.

Why it matters

This paper addresses a critical gap in evaluating VLMs for quantum computing, where interpreting calibration plots is essential. It provides a much-needed benchmark and insights into current VLM capabilities and limitations. The release of an open-weight reference model further aids research in this specialized domain.

Original Abstract

Quantum computing calibration depends on interpreting experimental data, and calibration plots provide the most universal human-readable representation for this task, yet no systematic evaluation exists of how well vision-language models (VLMs) interpret them. We introduce QCalEval, the first VLM benchmark for quantum calibration plots: 243 samples across 87 scenario types from 22 experiment families, spanning superconducting qubits and neutral atoms, evaluated on six question types in both zero-shot and in-context learning settings. The best general-purpose zero-shot model reaches a mean score of 72.3, and many open-weight models degrade under multi-image in-context learning, whereas frontier closed models improve substantially. A supervised fine-tuning ablation at the 9-billion-parameter scale shows that SFT improves zero-shot performance but cannot close the multimodal in-context learning gap. As a reference case study, we release NVIDIA Ising Calibration 1, an open-weight model based on Qwen3.5-35B-A3B that reaches 74.7 zero-shot average score.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers