Use this when you need to evaluate an AI agent's behavior over multi-step, tool-calling runs, not the quality of a single model response. Covers the offline regression suite (a fixed set of representative tasks with checkable success conditions), trajectory evaluation (did the agent take a sane path, call the right tools in a sane order, avoid loops and dangerous actions), the metrics that matter (task success rate, tool-call accuracy, unnecessary-action rate, cost and latency per task), and the online sampling that catches what offline misses. If you are scoring a single LLM call's output quality with a rubric, or deriving a pass/fail threshold, use /ai-eval-design instead, this skill defers single-output rubrics and threshold derivation there and builds on top of them.
Related skills: Single-output rubrics and pass-bar derivation live in
/ai-eval-design. Design the agent itself first with/ai-agent-design. The production-monitoring plumbing (logging, sampling pipeline, alerting) is specified in/llm-observability-plan. For agents that hand off to each other, coordinate with/multi-agent-orchestration.
The hard part most teams miss
Most teams test the agent's last message and call it an eval. That measures the answer, not the agent. An agent is a trajectory, and the trajectory is where it goes wrong.
-
Trajectory is not output. A correct answer reached through a dangerous or wasteful path is a failure, not a pass. The agent that deleted the staging database, then restored it from backup, and returned the right number still failed the eval. Score the path, not just the destination: which tools were called, in what order, how many steps, what irreversible actions fired along the way. An eval that only checks the final answer will green-light an agent that is one bad run from a disaster.
-
Offline suites decay unless a named owner adds new production failures. The regression suite is only as real as the failures it contains, and reality ships new failures every week. Suites do not rot because the method was wrong; they rot because nobody turns last Tuesday's production incident into test case 51. The highest-leverage decision in this whole plan is naming the person who owns that conversion (Step 7), not picking trajectory-match versus LLM-judge.
-
Online evals are where reality lives; offline alone gives false confidence. Your offline suite is a model of production, and the model is always wrong at the edges, the inputs you did not imagine, the tool that started timing out, the prompt injection in a real document. A green offline suite proves you did not regress against known cases. It proves nothing about the cases you have not seen. The gap is measurable in practice: observability adoption far outpaces eval adoption (observability around 89 percent, offline evals around 52 percent, online evals around 37 percent), so most teams are watching production without a regression net or a net without a way to know it still fits. Evals are the production-readiness signal that predicts which agents survive contact with users.
Process
Step 1: Gather inputs
Ask the user:
- What does the agent do, end to end, and what does "done" look like per task? (You need a concrete, checkable success condition per task, not a vibe. If you cannot state it, you cannot eval it.)
- What tools can it call, and which are irreversible or expensive? (This is the watch-list for trajectory scoring and the unnecessary-action metric.)
- What does a sane path look like for a typical task? (Roughly which tools, in what order. This is the reference trajectory you score against, loosely, not literally.)
- What are the cost and latency budgets per task? (A ceiling in dollars and seconds, or tokens and tool calls. Without a budget, "expensive" is undefined.)
- What real runs do you have to draw from? (Production logs, traces, prior incidents. These seed both the offline suite and the online sampling.)
- Who will own adding new production failures to the suite? (Name a person now. If the answer is "we'll figure it out," the suite is already decaying.)
If the user is still defining single-output quality (what a good answer looks like, what the pass bar is), stop and run /ai-eval-design first. This skill assumes per-task success conditions already exist and evaluates the behavior that reaches them.
Step 2: Build the offline regression suite
A regression suite is a fixed, version-controlled set of representative tasks, each with a deterministic success condition the harness can check without a human in the loop where possible.
For each task, capture:
- Task input: the exact prompt, context, and starting state handed to the agent.
- Success condition: a checkable assertion on the end state (a file exists, a record was created, the returned value matches, the right ticket was closed). Prefer state checks over string matching on the final message.
- Reference trajectory (loose): the tools a sane run would call and a rough order. Used to score path quality, not to demand an exact match.
- Forbidden actions: irreversible or out-of-scope tool calls that fail the task if they fire, regardless of the final answer.
- Source: where the task came from (designed happy path, edge case, or a real production failure).
Cover four task categories:
| Category | Count | Purpose |
|---|---|---|
| Happy path | 10-20 | Typical, well-formed tasks the agent should nail |
| Multi-step / branching | 8-15 | Tasks that require several tool calls, recovery from a tool error, or a decision between paths |
| Adversarial / trap | 5-10 | Tasks with a tempting wrong path, a prompt injection, or an irreversible action that should not fire |
| Regression | grows | Real production failures, one task per incident, added over time by the owner |
Version the suite alongside the agent's prompts and tool definitions. When the prompt changes, you need to know which suite version set the baseline.
Step 3: Define trajectory scoring
Decide how each run is judged on its path, not just its answer. Score on these axes:
| Axis | What it measures | How to check |
|---|---|---|
| Task success | Did the end state meet the success condition? | Deterministic state assertion |
| Tool-call accuracy | Were the right tools called with sane arguments? | Compare against reference trajectory and tool schemas |
| Path sanity | Sane order, no thrash, no loops, no no-progress cycles | Detect repeated identical calls; cap steps; LLM-judge for order |
| Unnecessary actions | Tool calls that did not advance the task | Count calls outside the reference trajectory minus justified detours |
| Safety | Did any forbidden or irreversible action fire? | Hard-fail check against the forbidden-actions list |
Two rules keep this honest:
- Safety is a hard fail, outside any weighted score. A run that fires a forbidden action fails, period, even if the final answer is correct and every other axis is green. Do not let an aggregate average wash out a destructive action.
- Trajectory match is loose, not literal. A run that reaches the goal by a different sane path is a pass, not a regression. Score whether the path was reasonable and safe, not whether it matched a golden transcript token for token. Use an LLM-judge for the judgment calls (was this order sane, was this detour justified) and deterministic checks for the rest.
Step 4: Define the metrics and budgets
Aggregate per-run scores into suite-level metrics, each with a target and a hard limit:
| Metric | Definition | Target | Hard limit |
|---|---|---|---|
| Task success rate | Share of tasks meeting the success condition | (target) | (minimum acceptable) |
| Tool-call accuracy | Share of tool calls that were correct and well-formed | (target) | (minimum acceptable) |
| Unnecessary-action rate | Avg unnecessary tool calls per task | (target) | (maximum acceptable) |
| Safety pass rate | Share of runs with zero forbidden actions | 100% | 100% (any breach blocks ship) |
| Cost per task | Avg dollars or tokens per completed task | (budget) | (ceiling) |
| Latency per task | p50 / p95 wall-clock per task | (target) | (maximum) |
Cost and latency are first-class, not afterthoughts. An agent that passes on quality but burns the budget per task is not shippable, and you only see that if you measure it every run.
Step 5: Add online / production evals
Offline proves no regression against known cases. Online catches the unknown. Specify:
- Sampling: what fraction of production runs to score (start with a fixed daily sample plus 100 percent of runs that hit an error or a forbidden-action guard).
- Scoring in production: the same trajectory axes from Step 3, run on sampled real traces. Deterministic checks where possible; LLM-judge for path sanity.
- Regression detection: alert when task success rate, safety pass rate, or unnecessary-action rate drifts past its hard limit over a rolling window, not on a single bad run.
- Capture for the suite: every production failure is logged with enough context (input, full trace, end state) to become a regression task in Step 2.
The logging, trace storage, sampling pipeline, and alerting channels are infrastructure. Specify the eval logic here and point at /llm-observability-plan for the plumbing rather than redesigning it.
Step 6: Output the harness design
# Agent Eval Harness: {{agent name}}
**Agent job:** {{one sentence}}
**Per-task success condition style:** {{state check / value match / external assertion}}
**Owner of new regression cases:** {{named person}}
## Offline regression suite
| Category | Count | Notes |
|---|---|---|
| Happy path | | |
| Multi-step / branching | | |
| Adversarial / trap | | |
| Regression (grows) | | |
- Storage / versioning: {{where the suite lives, how it is versioned}}
- Per-task fields captured: input, success condition, loose reference trajectory, forbidden actions, source
## Trajectory scoring
| Axis | Check method | Hard-fail? |
|---|---|---|
| Task success | | |
| Tool-call accuracy | | |
| Path sanity | | |
| Unnecessary actions | | |
| Safety | | yes |
- Judge model (for path-sanity calls): {{model}}
- Loop / no-progress detection: {{mechanism, step cap}}
## Metrics and budgets
| Metric | Target | Hard limit |
|---|---|---|
| Task success rate | | |
| Tool-call accuracy | | |
| Unnecessary-action rate | | |
| Safety pass rate | 100% | 100% |
| Cost per task | | |
| Latency per task (p50 / p95) | | |
## Online evals
- Sampling: {{fraction + always-score conditions}}
- Production scoring: {{which axes, deterministic vs judge}}
- Regression alert: {{metric + window + threshold}}
- Failure capture: {{how a prod failure becomes a regression task}}
- Observability plumbing: see /llm-observability-plan
## Ownership and cadence
- New-failure owner: {{name}}
- Suite run triggers: pre-deploy (full suite), nightly (sample), on incident (add case + re-run)
## Open questions
- {{unresolved decisions}}
Step 7: Review
Ask the user:
- Does any task pass on its final answer while taking a path you would not want in production? If so, the trajectory scoring is too loose.
- Is every irreversible tool call on the forbidden-actions list for the tasks where it should not fire?
- Who, by name, turns the next production failure into the next regression case, and when? If there is no name, the suite will decay.
- Are the cost and latency hard limits real ceilings you would block a ship on, or aspirational numbers?
- Does the online sample plus always-score conditions actually catch the failure types you are most afraid of?
Anti-patterns
| Anti-pattern | Why it fails | Do instead |
|---|---|---|
| Scoring only the final answer | Green-lights agents that reach the goal through dangerous or wasteful paths | Score the trajectory: tools, order, steps, forbidden actions |
| Exact-match trajectory checks | Flags every sane alternate path as a regression; suite becomes noise | Score path sanity loosely with an LLM-judge, keep deterministic checks for state |
| Offline suite, no online evals | Proves no regression against imagined cases, blind to real-world failures | Sample production and feed failures back into the suite |
| No named suite owner | Suite stops resembling reality as new failures never get added | Name one person who converts production failures to test cases |
| Ignoring cost and latency | Ships an agent that passes on quality but burns the budget per task | Make cost and latency per task first-class metrics with hard limits |
| Folding safety into an average | One catastrophic action gets washed out by a high aggregate score | Keep safety as a hard fail outside any weighted score |
Output location
Present the harness design as formatted text in the conversation for the user to copy into their design doc.