Promptfoo: LLM Evals & Red Teaming#
Summary#
Open-source CLI and library for evaluating and red-teaming LLM apps. Now part of OpenAI. MIT licensed. The closest thing to a turnkey skill eval tool — YAML-based test cases, CI/CD integration, model comparison, and red teaming. Powers LLM apps serving 10M+ users in production.
Key Takeaways#
- Developer-first: Fast, with live reload and caching. Runs 100% locally — prompts never leave your machine.
- YAML-based test cases: Define inputs, expected outputs, and grading criteria in YAML. Directly maps to the
eval.yamlformat proposed in how-to-eval-a-skill. - Multiple grading methods: Exact match, contains, regex, LLM-as-judge, custom functions. Covers both deterministic (Tier 1) and LLM-graded (Tier 2) evaluation.
- CI/CD integration: Run evals in GitHub Actions, block merges on failures. This is the “eval on every commit” pattern from evaluating-agent-skills-caparas.
- Red teaming: Vulnerability scanning for prompt injection, jailbreaks, and other attacks. Relevant to ten-pillars-agentic-skill-design Pillar 6 (security).
- Model comparison: Side-by-side comparison across providers (OpenAI, Anthropic, Azure, Bedrock, Ollama). Useful for the per-pattern model mapping that fabric supports.
- Now part of OpenAI: Acquired but remains open source and MIT licensed.
Connections#
- skill-evaluation: Promptfoo is the most practical tool for implementing the three-tier eval framework. Its YAML test cases + CI/CD integration + LLM-as-judge support covers Tiers 1 and 2.
- how-to-eval-a-skill: The
eval.yamlformat we proposed could be implemented directly in Promptfoo’s YAML config format. - agent-skills-standard: An
evals/directory in the skill structure could contain Promptfoo config files, making skills self-evaluating.