GAIA: A Benchmark for General AI Assistants#

Original | Raw

Benchmark by Mialon et al. (Meta, Hugging Face) for evaluating AI assistants on real-world tasks requiring reasoning, tool use, web browsing, and multimodal understanding. Unlike code-focused benchmarks, GAIA measures generalized intelligence.

Dataset#

466 human-annotated questions across three difficulty levels:

  • Level 1: ≤5 steps, generally no tools
  • Level 2: moderate tool use, multiple steps
  • Level 3: arbitrarily long action sequences, any number of tools

Evaluation via exact string match — cheap and unambiguous.

Performance Gap#

  • Non-expert humans: ~92% success rate
  • AI models (initial): <50% on easiest tasks

The inverse of most recent benchmarks where AI approaches human scores. GAIA tasks are easy for humans but hard for AI — testing fundamental abilities, not specialized knowledge.

Key Design Principles#

  • Simple to verify but hard to solve
  • Real-world grounding (actual assistant use cases)
  • Tool-use required, not just language understanding
  • No simulated environments needed (unlike agentbench)

Limitations#

Data contamination risk, string matching may miss valid phrasings, web-dependent questions may break over time.

See Also#