HumanEval: Evaluating Large Language Models Trained on Code#

Original | Raw

The foundational code generation benchmark by Chen et al. (OpenAI, 2021). 164 hand-crafted Python problems. Introduced the pass@k metric that became the standard for measuring AI coding capabilities.

The pass@k Metric#

  • pass@1: probability single generated solution is correct
  • pass@10/100: probability at least one of k attempts is correct

Acknowledges how programmers actually work — iterate, debug, refine.

Performance Evolution (0% → 96.3% in 3 years)#

YearModelpass@1
2021GPT-30%
2021Codex28.8%
2023Top models70-80%
2024-25O1 Preview/Mini96.3%

EvalPlus (Enhanced)#

Additional test cases expose edge cases. O1 drops from 96.3% → 89%. The consistent 7-8 point gap across all models reveals systematic limitation: pattern matching works, robust reasoning doesn’t.

Limitations#

  • Contamination: 164 problems widely available since 2021
  • Narrow scope: simple self-contained problems, not real codebases
  • Binary evaluation: pass/fail ignores readability, efficiency, security
  • Measurement ceiling: multiple models at 95%+ — can’t differentiate

Spawned Ecosystem#

EvalPlus, HumanEval-X (multi-language), HumanEval+ (extended tests), HumanEval-V (visual/multi-modal).

See Also#