AgentBench: Evaluating LLMs as Agents#
Multi-dimensional benchmark by Liu et al. (Tsinghua, ICLR 2024) evaluating LLMs as autonomous agents across eight interactive environments. The broadest agent benchmark in the wiki — tests decision-making, not just generation.
Eight Environments#
- Operating System (shell commands)
- Database (SQL)
- Knowledge Graph (structured traversal)
- Digital Card Game (strategic decisions)
- Lateral Thinking Puzzles (creative reasoning)
- House-Holding (embodied task planning)
- Web Shopping (e-commerce navigation)
- Web Browsing (information extraction)
5-50 turns per problem. Multi-turn interactive, not single-shot.
Key Findings#
- Top commercial LLMs (GPT-4 class) strong as agents
- Significant gap between commercial and open-source (even 70B models)
- Performance varies across environments — strength in one ≠ strength in all
General AgentBench (2025)#
Substantial performance degradation moving from domain-specific to general-agent settings. Models that look good on narrow benchmarks struggle with breadth.
Benchmark Landscape Comparison#
| Benchmark | Focus | Format |
|---|---|---|
| humaneval-benchmark | Code generation | Single-turn, 164 problems |
| swe-bench | Software engineering | Patch generation |
| gaia-benchmark | General AI assistant | Multi-step Q&A |
| agentbench | Agent decision-making | Multi-turn, 8 environments |