The phrase "AI agent" gets thrown around loosely. For this framework, an agent is an AI system that can take actions - not just generate text, but read data, call APIs, send messages, modify records. The shift from AI-as-advisor to AI-as-actor changes everything about how you need to design the experience.
Most agent failures look like intelligence problems. The output was wrong, the action was inappropriate, the context was missed. But when you dig in, the root cause is almost always structural. The agent didn't have the right information, didn't know what it was supposed to do, or didn't have the right tools configured.
The waterline model
Think of agent failures like an iceberg. The visible failure - bad output, wrong action, missed context - sits above the waterline. Below it: missing permissions, incomplete context, wrong tool configuration, unclear identity documents, absent feedback loops.
Diagnostic sequence when an agent underperforms:
- Does the agent have access to the information it needs? (Permissions)
- Does the agent know what it's supposed to do? (Identity/soul document)
- Does the agent have the right tools configured? (Integrations)
- Is the context window overloaded or underscoped? (Context management)
- Only after all structural issues are ruled out: is the model capability the bottleneck?
Teams waste enormous time prompt-engineering around structural problems. Fix the infrastructure first.
Three layers of agent design
An effective agent system has three layers. Most products only build one.
Soul
The identity document. Who the agent is, what it values, how it communicates, what it refuses to do. This persists across sessions and provides behavioral consistency.
A good soul document covers: voice and tone, domain expertise, decision-making principles, escalation boundaries, and explicit limitations. Without it, the agent's personality changes with every conversation.
Heartbeat
The recurring cadence. Scheduled tasks that run without prompting: morning briefings, end-of-day summaries, weekly pulse checks, inbox monitoring. This turns a passive tool into an active collaborator.
Heartbeat is what separates "I have a chatbot" from "I have a system that works for me." Most products skip this layer entirely and wonder why adoption stalls after the novelty wears off.
Jobs
The task backlog. Specific work the agent can do on request: generate a report, draft a response, analyze a dataset, triage incoming items. In most AI products, this is the entire offering.
Jobs are necessary but insufficient. Without soul, the jobs lack consistency. Without heartbeat, the user has to remember to use the agent. The full stack - soul, heartbeat, jobs - creates a system that knows who it is, shows up proactively, and does useful work.
Progressive trust
Grant agent capabilities incrementally, not all at once:
| Trust level | Capabilities | Example |
|---|---|---|
| Read-only | View data, analyze, summarize | Calendar read, email read, file analysis |
| Draft | Generate content for human review | Email drafts, document drafts, code suggestions |
| Send with approval | Take actions, but human confirms each one | Send email after review, create ticket after confirmation |
| Autonomous | Act independently within defined boundaries | Auto-triage low-priority items, schedule recurring tasks |
Each level requires demonstrated reliability at the previous level. Jumping straight to autonomous is how you get agents sending embarrassing emails or creating duplicate records.
This trust model maps to security thinking - each trust level requires a different security posture. And the UX for each level is covered in agentic UX.
Measuring agent health
Use the AI Health Indicator to assess your agent across all six CARATS dimensions. Agents are particularly vulnerable to:
- Consistency failures when context management is poor
- Alignment failures when the soul document is missing or vague
- Security failures when trust levels aren't enforced
- Tone failures when the agent operates across different contexts without adjusting
Track these metrics over time. Agent quality degrades silently - a model update, a context change, a new integration can all shift behavior without anyone noticing.
Skills for this topic
AI skills you can run with Claude or Codex to put this practice to work.
/ai-agent-designAI Agent DesignDesign a multi-step AI agent: the tier decision, tool surface, loop and termination, state, context management, and failure recovery.
/ai-guardrails-designAI Guardrails DesignDesign layered defenses for an AI feature: input validation, output filtering, jailbreak and abuse detection, calibrated against false-block cost.
/ai-health-checkAI Health CheckRun a structural health check of an AI system.
/llm-observability-planLLM Observability PlanPlan monitoring for LLM-powered features in production -- quality metrics, cost tracking, drift detection, and alerting.
/ai-product-specAI Product SpecSpec an AI-powered feature covering model requirements, prompt architecture, quality bar, cost projections, and guardrails.
/multi-model-strategyMulti-Model StrategyChoose the right AI model for each job in a product -- model mapping, routing, cost modeling, and migration planning.
Apps for this topic
Real, free tools on this site that do this work for you right now.
Already using AI? This checks whether you are using it well. Measures context discipline, evaluation maturity, and experimentation rigor.
Shipping an AI feature without evals is shipping on vibes. Learn what evals are, then build one: quality rubric, starter golden dataset, eval pipeline, and a complete plan you can download.
Related practices
Related services
Want help with agent experience?
I coach teams on this practice. Let's talk about your situation.
Get in touch