Promptfoo: LLM Evals & Red Teaming#

Original | Docs | Raw

Summary#

Open-source CLI and library for evaluating and red-teaming LLM apps. Now part of OpenAI. MIT licensed. The closest thing to a turnkey skill eval tool — YAML-based test cases, CI/CD integration, model comparison, and red teaming. Powers LLM apps serving 10M+ users in production.

Key Takeaways#

  • Developer-first: Fast, with live reload and caching. Runs 100% locally — prompts never leave your machine.
  • YAML-based test cases: Define inputs, expected outputs, and grading criteria in YAML. Directly maps to the eval.yaml format proposed in how-to-eval-a-skill.
  • Multiple grading methods: Exact match, contains, regex, LLM-as-judge, custom functions. Covers both deterministic (Tier 1) and LLM-graded (Tier 2) evaluation.
  • CI/CD integration: Run evals in GitHub Actions, block merges on failures. This is the “eval on every commit” pattern from evaluating-agent-skills-caparas.
  • Red teaming: Vulnerability scanning for prompt injection, jailbreaks, and other attacks. Relevant to ten-pillars-agentic-skill-design Pillar 6 (security).
  • Model comparison: Side-by-side comparison across providers (OpenAI, Anthropic, Azure, Bedrock, Ollama). Useful for the per-pattern model mapping that fabric supports.
  • Now part of OpenAI: Acquired but remains open source and MIT licensed.

Connections#

  • skill-evaluation: Promptfoo is the most practical tool for implementing the three-tier eval framework. Its YAML test cases + CI/CD integration + LLM-as-judge support covers Tiers 1 and 2.
  • how-to-eval-a-skill: The eval.yaml format we proposed could be implemented directly in Promptfoo’s YAML config format.
  • agent-skills-standard: An evals/ directory in the skill structure could contain Promptfoo config files, making skills self-evaluating.

See Also#