Agent Benchmarks#

Standardized benchmarks for measuring LLM and agent capabilities. The wiki previously identified evaluation as “the weakest link” (key-insights-agentic-landscape). These four benchmarks represent the current state of the art across code generation, software engineering, general reasoning, and interactive agent tasks.

The Benchmark Landscape#

BenchmarkFocusScaleFormatTop Score
humaneval-benchmarkCode generation164 Python problemsSingle-turn, pass@k96.3% (O1)
swe-benchSoftware engineering2,294 GitHub issuesPatch generation74.4% (Claude 4.5 Opus)
gaia-benchmarkGeneral AI assistant466 questionsMulti-step Q&A + tools<50% AI vs 92% human
agentbenchAgent decision-making8 environmentsMulti-turn interactiveCommercial » open-source

Evolution of Difficulty#

The benchmarks form a progression from narrow to broad:

  1. HumanEval (2021): Can the model write a function? → Largely solved (96%+)
  2. SWE-bench (2023): Can the model fix a real bug in a real codebase? → Rapidly improving (74%)
  3. GAIA (2023): Can the model reason across domains with tools? → Still far from human (92% vs <50%)
  4. AgentBench (2023): Can the model act autonomously across diverse environments? → Commercial models lead, open-source lags

Key Patterns Across Benchmarks#

  • Contamination risk: all face it. HumanEval most vulnerable (164 fixed problems since 2021).
  • Narrow vs broad: models that excel on narrow benchmarks (HumanEval) may struggle with breadth (AgentBench General).
  • The last 25%: SWE-bench’s self-driving car analogy — remaining unsolved problems may be disproportionately hard.
  • Human baselines: GAIA uniquely provides one (92%). Most benchmarks lack clear human comparison.

What’s Still Missing#

Per skill-evaluation and how-to-eval-a-skill:

  • No standard benchmark for skill-level evaluation (individual agent capabilities, not whole-model)
  • No benchmark for multi-agent coordination quality
  • No benchmark for memory/persistence quality (though LOCOMO from mem0-memory-management is closest)
  • No integrated cost-per-task metric alongside accuracy

Connection to Wiki Eval Framework#

The skill-evaluation three-tier framework (deterministic → LLM-judge → human) and how-to-eval-a-skill practical guide complement these benchmarks. Benchmarks measure model-level capability; the wiki’s eval framework measures skill-level quality in production.

See Also#