ArXiv TLDR

GAIA: a benchmark for General AI Assistants

🐦 Tweet
2311.12983

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun + 1 more

cs.CLcs.AI

TLDR

GAIA is a new benchmark designed to evaluate AI assistants on real-world tasks requiring reasoning, multi-modality, web browsing, and tool use, highlighting a significant gap between AI and human performance.

Key contributions

  • Introduces GAIA, a benchmark with 466 real-world questions testing fundamental AI assistant abilities.
  • Demonstrates a large performance gap: humans score 92% while GPT-4 with plugins scores only 15%.
  • Focuses on tasks simple for humans but challenging for AI, emphasizing robustness over superhuman specialization.

Why it matters

This paper matters because it shifts AI evaluation from narrowly defined professional tasks to broader, human-like general abilities essential for true Artificial General Intelligence. By highlighting current AI limitations on everyday reasoning and tool use, GAIA provides a meaningful challenge and a roadmap for advancing AI systems toward human-level robustness and versatility.

Original Abstract

We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92\% vs. 15\% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board available at https://huggingface.co/gaia-benchmark.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.