AgentBench: Evaluating LLMs as Agents#

Original | Raw

Multi-dimensional benchmark by Liu et al. (Tsinghua, ICLR 2024) evaluating LLMs as autonomous agents across eight interactive environments. The broadest agent benchmark in the wiki — tests decision-making, not just generation.

Eight Environments#

Operating System (shell commands)
Database (SQL)
Knowledge Graph (structured traversal)
Digital Card Game (strategic decisions)
Lateral Thinking Puzzles (creative reasoning)
House-Holding (embodied task planning)
Web Shopping (e-commerce navigation)
Web Browsing (information extraction)

5-50 turns per problem. Multi-turn interactive, not single-shot.

Key Findings#

Top commercial LLMs (GPT-4 class) strong as agents
Significant gap between commercial and open-source (even 70B models)
Performance varies across environments — strength in one ≠ strength in all

General AgentBench (2025)#

Substantial performance degradation moving from domain-specific to general-agent settings. Models that look good on narrow benchmarks struggle with breadth.

Benchmark Landscape Comparison#

Benchmark	Focus	Format
humaneval-benchmark	Code generation	Single-turn, 164 problems
swe-bench	Software engineering	Patch generation
gaia-benchmark	General AI assistant	Multi-step Q&A
agentbench	Agent decision-making	Multi-turn, 8 environments

Agentbench

AgentBench: Evaluating LLMs as Agents#

Eight Environments#

Key Findings#

General AgentBench (2025)#

Benchmark Landscape Comparison#

See Also#