Skip to main content
AI & Agents/agent-reliability-audit

Agent Reliability Audit

Audit a live AI agent for reliability: termination caps, runaway cost, tool-error recovery, context bloat, and gating on irreversible actions.

Use this when an AI agent is already in production (or staging with real traffic) and you need to find out why it is unreliable, expensive, or unsafe. The work is reading the harness, not the model: termination conditions, step and cost caps, tool-error recovery, context bloat, gates on irreversible actions, and whether anyone instrumented it well enough to even tell. If you are designing a new agent rather than auditing a running one, use /ai-agent-design instead.

Related skills: Pairs with /ai-agent-design for the design patterns this audit checks against. Build trajectory tests with /agent-eval-harness, wire monitoring with /llm-observability-plan, and audit coordinated agents with /multi-agent-orchestration.

The hard part most teams miss

By 2026 agents are in production at scale, and the dominant failure mode is the harness, not the model. Observability adoption is outpacing eval adoption, which means most teams can watch their agent misbehave without being able to prove it improved. Audit the loop, the state, and the gates first.

  1. Most "the agent is broken" reports are harness failures. The model emits tool calls; the harness decides what runs, what is safe, and when to stop. A missing termination condition, no recovery from a tool error, a context window stuffed with stale outputs: none of these are the model's fault, and swapping models will not fix them. Audit the loop before you blame the weights.
  2. Gate the failure you fear most before anything else. The worst outcome is usually runaway cost or an irreversible action taken wrongly (an email sent, data deleted, money moved). These rank highest in any findings list, ahead of accuracy nits. If the catastrophic action is not gated in the harness, that is finding number one regardless of how well the agent performs on a good day.
  3. Reliability is only observable if someone instrumented it. An agent with no run logs, no per-step traces, and no cost attribution cannot be audited, only guessed at. If you cannot see the loop, your first finding is "we are flying blind" and the first fix is instrumentation, because every later claim depends on it.

Process

Step 1: Gather inputs

Ask the user:

  1. What does the agent do, and what does "done" look like? {{agent_job_and_done}} (The job and its checkable success condition.)
  2. What is failing, in their words? {{reported_symptoms}} (Wrong answers, runaway cost, hangs, dangerous actions, silent stalls.)
  3. Can I see the harness? {{harness_access}} (Loop code, tool definitions, prompts, config. Without this you are auditing a black box, say so.)
  4. Can I see run logs or traces? {{observability_access}} (Per-step traces, token and cost data, error rates. If none exist, that is itself a finding.)
  5. What is the cost of the worst action it can take? {{worst_action}} (Reversible and cheap, or irreversible and expensive. This sets severity.)
  6. What caps and gates exist today? {{existing_controls}} (Max steps, time and cost budgets, confirmations, allowlists, human-in-the-loop.)

If harness access is denied, scope the audit to what logs reveal and mark every harness-internal finding .

Step 2: Audit the loop and termination

  • Termination conditions: enumerate every way the loop can end (success met, max steps, budget exhausted, unrecoverable error, hand-back). A loop with no hard cap is a runaway bill waiting to happen.
  • Step cap: confirm a max-step limit exists and fires. An unbounded loop is critical even if it has never tripped yet.
  • No-progress detection: check for repeated identical tool calls or cycles that burn steps without advancing. Detect and break them.

Step 3: Audit cost and runaway controls

  • Cost budget: is there a per-run token or dollar cap, and does it halt the run? This is the highest-frequency catastrophic finding.
  • Cost attribution: can a single expensive run be traced to its cause (a loop, a huge context, a retry storm)?
  • Concurrency and retry storms: confirm retries have backoff and a ceiling, and that parallel fan-out is bounded.

Step 4: Audit gates on irreversible actions

  • Inventory irreversible tools: external writes, sends, deletes, payments, anything hard to undo.
  • Confirm the gate is in the harness, not the prompt: the model must not be the gate. Look for confirmations, allowlists, or human-in-the-loop before each irreversible call.
  • Reversibility check: every irreversible action must be gated or made reversible. A gap here is critical by default.

Step 5: Audit tool-error recovery and context

  • Tool errors: confirm a failed tool call returns to the model as a result it can react to, not a crash. Check retry-versus-report per tool.
  • Partial progress: on failure, does the agent hand back what it accomplished plus what remains, or nothing?
  • Context management: check whether the transcript prunes stale tool outputs, summarizes completed sub-tasks, or persists to memory. Unbounded context quietly raises cost and lowers quality.

Step 6: Audit instrumentation and evals

  • Logs and traces: are runs logged with per-step traces, inputs, outputs, tool calls, and errors?
  • Monitoring: are cost, latency, error rate, and step count tracked over time with alerts?
  • Evals: is there any trajectory or outcome eval gating changes, or is "it seemed fine" the only test? Note that observability without evals means regressions ship silently.

Step 7: Produce the prioritized findings list

Rank by severity. Critical = runaway cost or an ungated irreversible action. High = no termination cap, no recovery, no instrumentation. Medium = context bloat, weak retries, missing evals. Low = polish. Every finding cites evidence (a file, a log line, an absence) and names a concrete fix.

# Agent Reliability Audit: (agent name)

**Scope:** (harness + logs / logs only / black box)
**Worst action it can take:** (and whether it is gated)

## Findings (highest severity first)

| Finding | Severity | Evidence | Fix |
|---|---|---|---|
| No per-run cost cap; one loop ran to $N | Critical | trace #123, 600 steps, no halt | Add hard token+dollar budget that aborts the run |
| send_email gated only by prompt instruction | Critical | tools.py L40, no confirmation path | Move gate to harness: confirm or allowlist before send |
| No max-step cap on main loop | High | runner.py loop, no counter | Add max-step limit with forced summary on hit |
| Tool errors throw and kill the run | High | logs show crash on 502 | Return error to model as a result; retry-with-backoff |
| No per-step traces | High | only final output logged | Instrument per-step input/output/tool/error before other fixes |
| Context never pruned; window grows unbounded | Medium | trace token count climbs each step | Prune stale tool results, summarize finished sub-tasks |
| No evals gating changes | Medium | no eval suite in repo | Add trajectory eval via /agent-eval-harness |

## Top 3 to fix first
1. (the catastrophic, ungated one)
2. (the runaway-cost one)
3. (the instrumentation gap everything else depends on)

## Open questions
- (anything blocked by lack of access, marked )

Step 8: Review

Ask the user:

  • Is the worst action this agent can take gated in the harness, not just discouraged in the prompt?
  • On a bad day, how does the loop end and what stops the bill?
  • Could you reconstruct a single expensive or wrong run from the logs alone? If not, instrumentation is finding one.
  • Is "it seemed fine" the only test, or is there an eval that would catch a regression?

Anti-patterns

Anti-patternWhy it failsDo instead
Auditing the model firstBurns time tuning prompts and swapping models when the loop is the problemAudit termination, state, and gates before touching the model
Treating accuracy as the top findingRanks a wrong answer above an ungated payment or runaway loopRank cost and irreversible-action risk highest, always
Trusting prompt-level gates"Please confirm before deleting" is not a controlVerify the gate lives in the harness code, not the instructions
Auditing without logsFindings are guesses dressed as conclusionsIf there are no traces, make instrumentation finding one and mark the rest inferred
Calling it safe with no cost capOne bad loop runs up an unbounded billRequire a per-run token and dollar budget that halts the run
Observability without evalsYou can watch regressions ship but never block themPair monitoring with a trajectory eval that gates changes

Output location

Present the prioritized findings list as formatted text in the conversation for the user to copy into their audit doc or ticket tracker.