Skill Evaluation#

Moving from “it feels better” to “I have proof” when measuring AI agent skill quality.

The Problem#

LLM agents are non-deterministic. Manual testing captures one sample from a distribution. “Vibes-based” evaluation misses regressions, false positives, and edge cases. As Karpathy noted: “The eval is often harder than the task itself.”

Three-Tier Framework#

Tier	Method	Cost	Frequency	Catches
1	Deterministic graders	~$0	Every commit	Command execution, file existence, sequence, format
2	LLM-as-judge	$0.01–0.20/eval	PRs, nightly	Code quality, conventions, readability (rubric-based)
3	Human review	$0.50–5.00/eval	Sparingly	Calibration, edge cases, high-stakes decisions

“The best eval is one that actually gets run.” — Anthropic

Four Categories of Success Criteria#

Outcome: Did the task complete? Is the output correct?
Process: Did the agent invoke the right skill? Follow intended steps?
Style: Does output follow conventions?
Efficiency: No thrashing? Reasonable token usage?

Every criterion must be binary and programmatically checkable.

Test Design#

20–50 prompts per skill minimum
Include negative controls (prompts that should NOT trigger the skill)
Use pass@k metrics (probability of success in k attempts) — run 5–10x minimum
Feed production failures back into eval sets continuously

LLM-as-Judge Pitfalls#

Position bias (prefers first/last options)
Verbosity bias (longer = higher scores)
Self-preference (models prefer own family’s output)
Inconsistency (same input, different scores across runs)
GPT-4 class: 70–85% agreement with human evaluators

What’s Missing in the Ecosystem#

The ten-pillars-agentic-skill-design paper explicitly acknowledged “no original controlled study” as a limitation. The key-insights-agentic-landscape analysis identified evaluation as a top gap. This framework provides the methodology to fill that gap, but no tool in the wiki has fully implemented it yet.