ArXiv TLDR

QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding

🐦 Tweet
2604.25884

Shuxiang Cao, Zijian Zhang, Abhishek Agarwal, Grace Bratrud, Niyaz R. Beysengulov + 27 more

quant-phcs.CV

TLDR

QCalEval introduces the first benchmark for evaluating Vision-Language Models on quantum calibration plot understanding, revealing performance gaps.

Key contributions

  • Introduces QCalEval, the first VLM benchmark for quantum calibration plots with 243 samples across 87 types.
  • Evaluates VLMs on 6 question types, showing general-purpose models reach 72.3 zero-shot mean score.
  • Finds open-weight models degrade with multi-image in-context learning, while closed models improve.
  • Releases NVIDIA Ising Calibration 1, an open-weight model achieving 74.7 zero-shot average score.

Why it matters

This paper addresses a critical gap in evaluating VLMs for quantum computing, where interpreting calibration plots is essential. It provides a much-needed benchmark and insights into current VLM capabilities and limitations. The release of an open-weight reference model further aids research in this specialized domain.

Original Abstract

Quantum computing calibration depends on interpreting experimental data, and calibration plots provide the most universal human-readable representation for this task, yet no systematic evaluation exists of how well vision-language models (VLMs) interpret them. We introduce QCalEval, the first VLM benchmark for quantum calibration plots: 243 samples across 87 scenario types from 22 experiment families, spanning superconducting qubits and neutral atoms, evaluated on six question types in both zero-shot and in-context learning settings. The best general-purpose zero-shot model reaches a mean score of 72.3, and many open-weight models degrade under multi-image in-context learning, whereas frontier closed models improve substantially. A supervised fine-tuning ablation at the 9-billion-parameter scale shows that SFT improves zero-shot performance but cannot close the multimodal in-context learning gap. As a reference case study, we release NVIDIA Ising Calibration 1, an open-weight model based on Qwen3.5-35B-A3B that reaches 74.7 zero-shot average score.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.