Agent Eval Harness

Use this when you need to evaluate an AI agent's behavior over multi-step, tool-calling runs, not the quality of a single model response. Covers the offline regression suite (a fixed set of representative tasks with checkable success conditions), trajectory evaluation (did the agent take a sane path, call the right tools in a sane order, avoid loops and dangerous actions), the metrics that matter (task success rate, tool-call accuracy, unnecessary-action rate, cost and latency per task), and the online sampling that catches what offline misses. If you are scoring a single LLM call's output quality with a rubric, or deriving a pass/fail threshold, use /ai-eval-design instead, this skill defers single-output rubrics and threshold derivation there and builds on top of them.

Related skills: Single-output rubrics and pass-bar derivation live in /ai-eval-design. Design the agent itself first with /ai-agent-design. The production-monitoring plumbing (logging, sampling pipeline, alerting) is specified in /llm-observability-plan. For agents that hand off to each other, coordinate with /multi-agent-orchestration.

The hard part most teams miss

Most teams test the agent's last message and call it an eval. That measures the answer, not the agent. An agent is a trajectory, and the trajectory is where it goes wrong.

Trajectory is not output. A correct answer reached through a dangerous or wasteful path is a failure, not a pass. The agent that deleted the staging database, then restored it from backup, and returned the right number still failed the eval. Score the path, not just the destination: which tools were called, in what order, how many steps, what irreversible actions fired along the way. An eval that only checks the final answer will green-light an agent that is one bad run from a disaster.
Offline suites decay unless a named owner adds new production failures. The regression suite is only as real as the failures it contains, and reality ships new failures every week. Suites do not rot because the method was wrong; they rot because nobody turns last Tuesday's production incident into test case 51. The highest-leverage decision in this whole plan is naming the person who owns that conversion (Step 7), not picking trajectory-match versus LLM-judge.
Online evals are where reality lives; offline alone gives false confidence. Your offline suite is a model of production, and the model is always wrong at the edges, the inputs you did not imagine, the tool that started timing out, the prompt injection in a real document. A green offline suite proves you did not regress against known cases. It proves nothing about the cases you have not seen. The gap is measurable in practice: observability adoption far outpaces eval adoption (observability around 89 percent, offline evals around 52 percent, online evals around 37 percent), so most teams are watching production without a regression net or a net without a way to know it still fits. Evals are the production-readiness signal that predicts which agents survive contact with users.

Process

Step 1: Gather inputs

Ask the user:

What does the agent do, end to end, and what does "done" look like per task? (You need a concrete, checkable success condition per task, not a vibe. If you cannot state it, you cannot eval it.)
What tools can it call, and which are irreversible or expensive? (This is the watch-list for trajectory scoring and the unnecessary-action metric.)
What does a sane path look like for a typical task? (Roughly which tools, in what order. This is the reference trajectory you score against, loosely, not literally.)
What are the cost and latency budgets per task? (A ceiling in dollars and seconds, or tokens and tool calls. Without a budget, "expensive" is undefined.)
What real runs do you have to draw from? (Production logs, traces, prior incidents. These seed both the offline suite and the online sampling.)
Who will own adding new production failures to the suite? (Name a person now. If the answer is "we'll figure it out," the suite is already decaying.)

If the user is still defining single-output quality (what a good answer looks like, what the pass bar is), stop and run /ai-eval-design first. This skill assumes per-task success conditions already exist and evaluates the behavior that reaches them.

Step 2: Build the offline regression suite

A regression suite is a fixed, version-controlled set of representative tasks, each with a deterministic success condition the harness can check without a human in the loop where possible.

For each task, capture:

Task input: the exact prompt, context, and starting state handed to the agent.
Success condition: a checkable assertion on the end state (a file exists, a record was created, the returned value matches, the right ticket was closed). Prefer state checks over string matching on the final message.
Reference trajectory (loose): the tools a sane run would call and a rough order. Used to score path quality, not to demand an exact match.
Forbidden actions: irreversible or out-of-scope tool calls that fail the task if they fire, regardless of the final answer.
Source: where the task came from (designed happy path, edge case, or a real production failure).

Cover four task categories:

Category	Count	Purpose
Happy path	10-20	Typical, well-formed tasks the agent should nail
Multi-step / branching	8-15	Tasks that require several tool calls, recovery from a tool error, or a decision between paths
Adversarial / trap	5-10	Tasks with a tempting wrong path, a prompt injection, or an irreversible action that should not fire
Regression	grows	Real production failures, one task per incident, added over time by the owner

Version the suite alongside the agent's prompts and tool definitions. When the prompt changes, you need to know which suite version set the baseline.

Step 3: Define trajectory scoring

Decide how each run is judged on its path, not just its answer. Score on these axes:

Axis	What it measures	How to check
Task success	Did the end state meet the success condition?	Deterministic state assertion
Tool-call accuracy	Were the right tools called with sane arguments?	Compare against reference trajectory and tool schemas
Path sanity	Sane order, no thrash, no loops, no no-progress cycles	Detect repeated identical calls; cap steps; LLM-judge for order
Unnecessary actions	Tool calls that did not advance the task	Count calls outside the reference trajectory minus justified detours
Safety	Did any forbidden or irreversible action fire?	Hard-fail check against the forbidden-actions list

Two rules keep this honest:

Safety is a hard fail, outside any weighted score. A run that fires a forbidden action fails, period, even if the final answer is correct and every other axis is green. Do not let an aggregate average wash out a destructive action.
Trajectory match is loose, not literal. A run that reaches the goal by a different sane path is a pass, not a regression. Score whether the path was reasonable and safe, not whether it matched a golden transcript token for token. Use an LLM-judge for the judgment calls (was this order sane, was this detour justified) and deterministic checks for the rest.

Step 4: Define the metrics and budgets

Aggregate per-run scores into suite-level metrics, each with a target and a hard limit:

Metric	Definition	Target	Hard limit
Task success rate	Share of tasks meeting the success condition	(target)	(minimum acceptable)
Tool-call accuracy	Share of tool calls that were correct and well-formed	(target)	(minimum acceptable)
Unnecessary-action rate	Avg unnecessary tool calls per task	(target)	(maximum acceptable)
Safety pass rate	Share of runs with zero forbidden actions	100%	100% (any breach blocks ship)
Cost per task	Avg dollars or tokens per completed task	(budget)	(ceiling)
Latency per task	p50 / p95 wall-clock per task	(target)	(maximum)

Cost and latency are first-class, not afterthoughts. An agent that passes on quality but burns the budget per task is not shippable, and you only see that if you measure it every run.

Step 5: Add online / production evals

Offline proves no regression against known cases. Online catches the unknown. Specify:

Sampling: what fraction of production runs to score (start with a fixed daily sample plus 100 percent of runs that hit an error or a forbidden-action guard).
Scoring in production: the same trajectory axes from Step 3, run on sampled real traces. Deterministic checks where possible; LLM-judge for path sanity.
Regression detection: alert when task success rate, safety pass rate, or unnecessary-action rate drifts past its hard limit over a rolling window, not on a single bad run.
Capture for the suite: every production failure is logged with enough context (input, full trace, end state) to become a regression task in Step 2.

The logging, trace storage, sampling pipeline, and alerting channels are infrastructure. Specify the eval logic here and point at /llm-observability-plan for the plumbing rather than redesigning it.

Step 6: Output the harness design

# Agent Eval Harness: {{agent name}}

**Agent job:** {{one sentence}}
**Per-task success condition style:** {{state check / value match / external assertion}}
**Owner of new regression cases:** {{named person}}

## Offline regression suite
| Category | Count | Notes |
|---|---|---|
| Happy path | | |
| Multi-step / branching | | |
| Adversarial / trap | | |
| Regression (grows) | | |
- Storage / versioning: {{where the suite lives, how it is versioned}}
- Per-task fields captured: input, success condition, loose reference trajectory, forbidden actions, source

## Trajectory scoring
| Axis | Check method | Hard-fail? |
|---|---|---|
| Task success | | |
| Tool-call accuracy | | |
| Path sanity | | |
| Unnecessary actions | | |
| Safety | | yes |
- Judge model (for path-sanity calls): {{model}}
- Loop / no-progress detection: {{mechanism, step cap}}

## Metrics and budgets
| Metric | Target | Hard limit |
|---|---|---|
| Task success rate | | |
| Tool-call accuracy | | |
| Unnecessary-action rate | | |
| Safety pass rate | 100% | 100% |
| Cost per task | | |
| Latency per task (p50 / p95) | | |

## Online evals
- Sampling: {{fraction + always-score conditions}}
- Production scoring: {{which axes, deterministic vs judge}}
- Regression alert: {{metric + window + threshold}}
- Failure capture: {{how a prod failure becomes a regression task}}
- Observability plumbing: see /llm-observability-plan

## Ownership and cadence
- New-failure owner: {{name}}
- Suite run triggers: pre-deploy (full suite), nightly (sample), on incident (add case + re-run)

## Open questions
- {{unresolved decisions}}

Step 7: Review