Skip to main content
AI & Agents/agent-eval-harness

Agent Eval Harness

Design an eval harness for AI agent trajectories: offline regression suites, trajectory scoring, and production sampling.

Use this when you need to evaluate an AI agent's behavior over multi-step, tool-calling runs, not the quality of a single model response. Covers the offline regression suite (a fixed set of representative tasks with checkable success conditions), trajectory evaluation (did the agent take a sane path, call the right tools in a sane order, avoid loops and dangerous actions), the metrics that matter (task success rate, tool-call accuracy, unnecessary-action rate, cost and latency per task), and the online sampling that catches what offline misses. If you are scoring a single LLM call's output quality with a rubric, or deriving a pass/fail threshold, use /ai-eval-design instead, this skill defers single-output rubrics and threshold derivation there and builds on top of them.

Related skills: Single-output rubrics and pass-bar derivation live in /ai-eval-design. Design the agent itself first with /ai-agent-design. The production-monitoring plumbing (logging, sampling pipeline, alerting) is specified in /llm-observability-plan. For agents that hand off to each other, coordinate with /multi-agent-orchestration.

The hard part most teams miss

Most teams test the agent's last message and call it an eval. That measures the answer, not the agent. An agent is a trajectory, and the trajectory is where it goes wrong.

  1. Trajectory is not output. A correct answer reached through a dangerous or wasteful path is a failure, not a pass. The agent that deleted the staging database, then restored it from backup, and returned the right number still failed the eval. Score the path, not just the destination: which tools were called, in what order, how many steps, what irreversible actions fired along the way. An eval that only checks the final answer will green-light an agent that is one bad run from a disaster.

  2. Offline suites decay unless a named owner adds new production failures. The regression suite is only as real as the failures it contains, and reality ships new failures every week. Suites do not rot because the method was wrong; they rot because nobody turns last Tuesday's production incident into test case 51. The highest-leverage decision in this whole plan is naming the person who owns that conversion (Step 7), not picking trajectory-match versus LLM-judge.

  3. Online evals are where reality lives; offline alone gives false confidence. Your offline suite is a model of production, and the model is always wrong at the edges, the inputs you did not imagine, the tool that started timing out, the prompt injection in a real document. A green offline suite proves you did not regress against known cases. It proves nothing about the cases you have not seen. The gap is measurable in practice: observability adoption far outpaces eval adoption (observability around 89 percent, offline evals around 52 percent, online evals around 37 percent), so most teams are watching production without a regression net or a net without a way to know it still fits. Evals are the production-readiness signal that predicts which agents survive contact with users.

Process

Step 1: Gather inputs

Ask the user:

  1. What does the agent do, end to end, and what does "done" look like per task? (You need a concrete, checkable success condition per task, not a vibe. If you cannot state it, you cannot eval it.)
  2. What tools can it call, and which are irreversible or expensive? (This is the watch-list for trajectory scoring and the unnecessary-action metric.)
  3. What does a sane path look like for a typical task? (Roughly which tools, in what order. This is the reference trajectory you score against, loosely, not literally.)
  4. What are the cost and latency budgets per task? (A ceiling in dollars and seconds, or tokens and tool calls. Without a budget, "expensive" is undefined.)
  5. What real runs do you have to draw from? (Production logs, traces, prior incidents. These seed both the offline suite and the online sampling.)
  6. Who will own adding new production failures to the suite? (Name a person now. If the answer is "we'll figure it out," the suite is already decaying.)

If the user is still defining single-output quality (what a good answer looks like, what the pass bar is), stop and run /ai-eval-design first. This skill assumes per-task success conditions already exist and evaluates the behavior that reaches them.

Step 2: Build the offline regression suite

A regression suite is a fixed, version-controlled set of representative tasks, each with a deterministic success condition the harness can check without a human in the loop where possible.

For each task, capture:

  • Task input: the exact prompt, context, and starting state handed to the agent.
  • Success condition: a checkable assertion on the end state (a file exists, a record was created, the returned value matches, the right ticket was closed). Prefer state checks over string matching on the final message.
  • Reference trajectory (loose): the tools a sane run would call and a rough order. Used to score path quality, not to demand an exact match.
  • Forbidden actions: irreversible or out-of-scope tool calls that fail the task if they fire, regardless of the final answer.
  • Source: where the task came from (designed happy path, edge case, or a real production failure).

Cover four task categories:

CategoryCountPurpose
Happy path10-20Typical, well-formed tasks the agent should nail
Multi-step / branching8-15Tasks that require several tool calls, recovery from a tool error, or a decision between paths
Adversarial / trap5-10Tasks with a tempting wrong path, a prompt injection, or an irreversible action that should not fire
RegressiongrowsReal production failures, one task per incident, added over time by the owner

Version the suite alongside the agent's prompts and tool definitions. When the prompt changes, you need to know which suite version set the baseline.

Step 3: Define trajectory scoring

Decide how each run is judged on its path, not just its answer. Score on these axes:

AxisWhat it measuresHow to check
Task successDid the end state meet the success condition?Deterministic state assertion
Tool-call accuracyWere the right tools called with sane arguments?Compare against reference trajectory and tool schemas
Path sanitySane order, no thrash, no loops, no no-progress cyclesDetect repeated identical calls; cap steps; LLM-judge for order
Unnecessary actionsTool calls that did not advance the taskCount calls outside the reference trajectory minus justified detours
SafetyDid any forbidden or irreversible action fire?Hard-fail check against the forbidden-actions list

Two rules keep this honest:

  • Safety is a hard fail, outside any weighted score. A run that fires a forbidden action fails, period, even if the final answer is correct and every other axis is green. Do not let an aggregate average wash out a destructive action.
  • Trajectory match is loose, not literal. A run that reaches the goal by a different sane path is a pass, not a regression. Score whether the path was reasonable and safe, not whether it matched a golden transcript token for token. Use an LLM-judge for the judgment calls (was this order sane, was this detour justified) and deterministic checks for the rest.

Step 4: Define the metrics and budgets

Aggregate per-run scores into suite-level metrics, each with a target and a hard limit:

MetricDefinitionTargetHard limit
Task success rateShare of tasks meeting the success condition(target)(minimum acceptable)
Tool-call accuracyShare of tool calls that were correct and well-formed(target)(minimum acceptable)
Unnecessary-action rateAvg unnecessary tool calls per task(target)(maximum acceptable)
Safety pass rateShare of runs with zero forbidden actions100%100% (any breach blocks ship)
Cost per taskAvg dollars or tokens per completed task(budget)(ceiling)
Latency per taskp50 / p95 wall-clock per task(target)(maximum)

Cost and latency are first-class, not afterthoughts. An agent that passes on quality but burns the budget per task is not shippable, and you only see that if you measure it every run.

Step 5: Add online / production evals

Offline proves no regression against known cases. Online catches the unknown. Specify:

  • Sampling: what fraction of production runs to score (start with a fixed daily sample plus 100 percent of runs that hit an error or a forbidden-action guard).
  • Scoring in production: the same trajectory axes from Step 3, run on sampled real traces. Deterministic checks where possible; LLM-judge for path sanity.
  • Regression detection: alert when task success rate, safety pass rate, or unnecessary-action rate drifts past its hard limit over a rolling window, not on a single bad run.
  • Capture for the suite: every production failure is logged with enough context (input, full trace, end state) to become a regression task in Step 2.

The logging, trace storage, sampling pipeline, and alerting channels are infrastructure. Specify the eval logic here and point at /llm-observability-plan for the plumbing rather than redesigning it.

Step 6: Output the harness design

# Agent Eval Harness: {{agent name}}

**Agent job:** {{one sentence}}
**Per-task success condition style:** {{state check / value match / external assertion}}
**Owner of new regression cases:** {{named person}}

## Offline regression suite
| Category | Count | Notes |
|---|---|---|
| Happy path | | |
| Multi-step / branching | | |
| Adversarial / trap | | |
| Regression (grows) | | |
- Storage / versioning: {{where the suite lives, how it is versioned}}
- Per-task fields captured: input, success condition, loose reference trajectory, forbidden actions, source

## Trajectory scoring
| Axis | Check method | Hard-fail? |
|---|---|---|
| Task success | | |
| Tool-call accuracy | | |
| Path sanity | | |
| Unnecessary actions | | |
| Safety | | yes |
- Judge model (for path-sanity calls): {{model}}
- Loop / no-progress detection: {{mechanism, step cap}}

## Metrics and budgets
| Metric | Target | Hard limit |
|---|---|---|
| Task success rate | | |
| Tool-call accuracy | | |
| Unnecessary-action rate | | |
| Safety pass rate | 100% | 100% |
| Cost per task | | |
| Latency per task (p50 / p95) | | |

## Online evals
- Sampling: {{fraction + always-score conditions}}
- Production scoring: {{which axes, deterministic vs judge}}
- Regression alert: {{metric + window + threshold}}
- Failure capture: {{how a prod failure becomes a regression task}}
- Observability plumbing: see /llm-observability-plan

## Ownership and cadence
- New-failure owner: {{name}}
- Suite run triggers: pre-deploy (full suite), nightly (sample), on incident (add case + re-run)

## Open questions
- {{unresolved decisions}}

Step 7: Review

Ask the user:

  • Does any task pass on its final answer while taking a path you would not want in production? If so, the trajectory scoring is too loose.
  • Is every irreversible tool call on the forbidden-actions list for the tasks where it should not fire?
  • Who, by name, turns the next production failure into the next regression case, and when? If there is no name, the suite will decay.
  • Are the cost and latency hard limits real ceilings you would block a ship on, or aspirational numbers?
  • Does the online sample plus always-score conditions actually catch the failure types you are most afraid of?

Anti-patterns

Anti-patternWhy it failsDo instead
Scoring only the final answerGreen-lights agents that reach the goal through dangerous or wasteful pathsScore the trajectory: tools, order, steps, forbidden actions
Exact-match trajectory checksFlags every sane alternate path as a regression; suite becomes noiseScore path sanity loosely with an LLM-judge, keep deterministic checks for state
Offline suite, no online evalsProves no regression against imagined cases, blind to real-world failuresSample production and feed failures back into the suite
No named suite ownerSuite stops resembling reality as new failures never get addedName one person who converts production failures to test cases
Ignoring cost and latencyShips an agent that passes on quality but burns the budget per taskMake cost and latency per task first-class metrics with hard limits
Folding safety into an averageOne catastrophic action gets washed out by a high aggregate scoreKeep safety as a hard fail outside any weighted score

Output location

Present the harness design as formatted text in the conversation for the user to copy into their design doc.