Agent Benchmarks#
Standardized benchmarks for measuring LLM and agent capabilities. The wiki previously identified evaluation as “the weakest link” (key-insights-agentic-landscape). These four benchmarks represent the current state of the art across code generation, software engineering, general reasoning, and interactive agent tasks.
The Benchmark Landscape#
| Benchmark | Focus | Scale | Format | Top Score |
|---|---|---|---|---|
| humaneval-benchmark | Code generation | 164 Python problems | Single-turn, pass@k | 96.3% (O1) |
| swe-bench | Software engineering | 2,294 GitHub issues | Patch generation | 74.4% (Claude 4.5 Opus) |
| gaia-benchmark | General AI assistant | 466 questions | Multi-step Q&A + tools | <50% AI vs 92% human |
| agentbench | Agent decision-making | 8 environments | Multi-turn interactive | Commercial » open-source |
Evolution of Difficulty#
The benchmarks form a progression from narrow to broad:
- HumanEval (2021): Can the model write a function? → Largely solved (96%+)
- SWE-bench (2023): Can the model fix a real bug in a real codebase? → Rapidly improving (74%)
- GAIA (2023): Can the model reason across domains with tools? → Still far from human (92% vs <50%)
- AgentBench (2023): Can the model act autonomously across diverse environments? → Commercial models lead, open-source lags
Key Patterns Across Benchmarks#
- Contamination risk: all face it. HumanEval most vulnerable (164 fixed problems since 2021).
- Narrow vs broad: models that excel on narrow benchmarks (HumanEval) may struggle with breadth (AgentBench General).
- The last 25%: SWE-bench’s self-driving car analogy — remaining unsolved problems may be disproportionately hard.
- Human baselines: GAIA uniquely provides one (92%). Most benchmarks lack clear human comparison.
What’s Still Missing#
Per skill-evaluation and how-to-eval-a-skill:
- No standard benchmark for skill-level evaluation (individual agent capabilities, not whole-model)
- No benchmark for multi-agent coordination quality
- No benchmark for memory/persistence quality (though LOCOMO from mem0-memory-management is closest)
- No integrated cost-per-task metric alongside accuracy
Connection to Wiki Eval Framework#
The skill-evaluation three-tier framework (deterministic → LLM-judge → human) and how-to-eval-a-skill practical guide complement these benchmarks. Benchmarks measure model-level capability; the wiki’s eval framework measures skill-level quality in production.