VISOR: A Vision-Language Model-based Test Oracle for Testing Robot
Prasun Saurabh, Pablo Valle, Aitor Arrieta, Shaukat Ali, Paolo Arcaini
TLDR
VISOR is a VLM-based test oracle that automates robot task assessment, replacing manual evaluation and quantifying task correctness and quality.
Key contributions
- VISOR automates robot task correctness and quality assessment using Vision-Language Models (VLMs).
- It replaces time-consuming and subjective human evaluations for robot testing.
- Quantifies task quality, addressing limitations of traditional pass/fail symbolic oracles.
- Evaluated with GPT and Gemini on 1,000+ videos, showing VLM performance differences.
Why it matters
This paper introduces an automated, VLM-based solution to the robot test oracle problem, significantly reducing the need for costly human evaluations. It improves upon traditional methods by quantifying task quality and assessing VLM uncertainty, advancing robot testing efficiency and reliability.
Original Abstract
Testing robots requires assessing whether they perform their intended tasks correctly, dependably, and with high quality, a challenge known as the test oracle problem in software testing. Traditionally, this assessment relies on task-specific symbolic oracles for task correctness and on human manual evaluation of robot behavior, which is time-consuming, subjective, and error-prone. To address this, we propose VISOR, a Vision-Language Model (VLM)-based approach for automated test oracle assessment that eliminates the need of expensive human evaluations. VISOR performs automated evaluation of task correctness and quality, addressing the limitations of existing symbolic test oracles, which are task-specific and provide pass/fail judgments without explicitly quantifying task quality. Given the inherent uncertainty in VLMs, VISOR also explicitly quantifies its own uncertainty during test assessments. We evaluated VISOR using two VLMs, i.e., GPT and Gemini, across four robotic tasks on over 1,000 videos. Results show that Gemini achieves higher recall while GPT achieves higher precision. However, both models show low correlation between uncertainty and correctness, which prevents using uncertainty as a correctness predictor.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.