Traditional software security has decades of hard-won wisdom: validate inputs, sanitize outputs, follow least privilege, assume breach. AI systems inherit all of those requirements and add new ones that most teams aren't thinking about yet.
The gap isn't awareness - most teams know prompt injection exists. The gap is practice. Teams acknowledge the risk in a design doc and then ship without testing for it. Security thinking for AI means integrating adversarial analysis into your regular development process, not treating it as a one-time review.
The AI risk landscape
Six risks show up repeatedly in production AI systems:
| Risk | What happens | Example |
|---|---|---|
| Prompt injection | Adversarial input overrides system instructions | User input containing "ignore previous instructions" changes AI behavior |
| System prompt leakage | AI reveals its hidden instructions | User asks "what are your instructions?" and the AI complies |
| Data leakage | Private data surfaces in outputs | AI trained on customer data reveals one customer's info to another |
| Cross-tenant contamination | One user's data influences another's responses | Shared context or caching leaks information across sessions |
| Unauthorized actions | AI performs high-impact operations without approval | Agent sends an email, modifies a record, or initiates a transaction without human review |
| Adversarial manipulation | Attacker poisons input data to skew AI behavior | Manipulated training data or behavioral baselines produce biased outputs |
The four-layer guardrail stack
Effective AI security uses defense in depth. No single layer catches everything.
Layer 1: Content safety. Filter inputs and outputs for policy violations - hate speech, violence, self-harm content. This is table stakes, not the whole strategy.
Layer 2: Input pattern filters. Scan for suspicious patterns in user input: hidden instructions, bypass attempts, prompt injection signatures, encoded payloads. This catches the obvious attacks.
Layer 3: Role-based tool gating. Limit what actions the AI can take based on user permissions. Send, delete, trade, and publish operations require explicit authorization. An AI agent should never have more permissions than the human it's acting for.
Layer 4: Output validation. Screen AI responses before they reach the user. Check for unexpected tokens, PII leakage, excessive length, off-brand language, and responses that don't match the expected format. This is your last line of defense.
Making security part of the process
Threat modeling for AI features
Before building an AI feature, ask:
- What's the worst thing this AI could say or do?
- Who benefits from manipulating this system?
- What data does this AI have access to that it shouldn't reveal?
- If the AI acts autonomously, what's the blast radius of a wrong action?
These questions belong in the product spec, not a separate security review. By the time security reviews the feature, the architecture is set and the guardrails are afterthoughts.
Adversarial testing as practice
Red-team your AI before users do. This means:
- Try to make it reveal system instructions
- Try to make it perform actions outside its intended scope
- Feed it edge-case inputs designed to confuse
- Test boundary conditions between what it should and shouldn't do
Build these tests into your CI/CD pipeline. Run them on every prompt change, every model update, every context modification. Manual security reviews don't scale; automated adversarial tests do.
Progressive trust
Not every AI feature needs the same security posture. Match the guardrail investment to the risk:
- Read-only AI (summarization, search, analysis): Focus on data leakage and prompt injection
- Draft-mode AI (generates content for human review): Add output validation
- Autonomous AI (takes actions directly): Full stack required, plus human-in-the-loop for high-impact operations
The agent experience and agentic UX practices cover how to design these trust levels into the user experience.
The CARATS security dimension
Security is the "S" in the CARATS framework. When assessing your team's security posture:
- Red (needs attention): No adversarial testing, no guardrail layers, security hasn't been discussed
- Yellow (at risk): Some guardrails in place, but not tested adversarially. Security reviewed once but not continuously monitored
- Green (healthy): Defense in depth implemented, adversarial tests in CI/CD, security boundaries documented and enforced, incident response plan exists
Most teams I assess land at Yellow. They've added a content filter and called it done. The gap between Yellow and Green is systematic adversarial testing and continuous monitoring.
Related practices
Related services
Want help with security thinking?
I coach teams on this practice. Let's talk about your situation.
Get in touch