Security Thinking

Traditional software security has decades of hard-won wisdom: validate inputs, sanitize outputs, follow least privilege, assume breach. AI systems inherit all of those requirements and add new ones that most teams aren't thinking about yet.

The gap isn't awareness - most teams know prompt injection exists. The gap is practice. Teams acknowledge the risk in a design doc and then ship without testing for it. Security thinking for AI means integrating adversarial analysis into your regular development process, not treating it as a one-time review.

The AI risk landscape

Six risks show up repeatedly in production AI systems:

Risk	What happens	Example
Prompt injection	Adversarial input overrides system instructions	User input containing "ignore previous instructions" changes AI behavior
System prompt leakage	AI reveals its hidden instructions	User asks "what are your instructions?" and the AI complies
Data leakage	Private data surfaces in outputs	AI trained on customer data reveals one customer's info to another
Cross-tenant contamination	One user's data influences another's responses	Shared context or caching leaks information across sessions
Unauthorized actions	AI performs high-impact operations without approval	Agent sends an email, modifies a record, or initiates a transaction without human review
Adversarial manipulation	Attacker poisons input data to skew AI behavior	Manipulated training data or behavioral baselines produce biased outputs

The four-layer guardrail stack

Effective AI security uses defense in depth. No single layer catches everything.

Layer 1: Content safety. Filter inputs and outputs for policy violations - hate speech, violence, self-harm content. This is table stakes, not the whole strategy.

Layer 2: Input pattern filters. Scan for suspicious patterns in user input: hidden instructions, bypass attempts, prompt injection signatures, encoded payloads. This catches the obvious attacks.

Layer 3: Role-based tool gating. Limit what actions the AI can take based on user permissions. Send, delete, trade, and publish operations require explicit authorization. An AI agent should never have more permissions than the human it's acting for.

Layer 4: Output validation. Screen AI responses before they reach the user. Check for unexpected tokens, PII leakage, excessive length, off-brand language, and responses that don't match the expected format. This is your last line of defense.

Making security part of the process

Threat modeling for AI features

Before building an AI feature, ask:

What's the worst thing this AI could say or do?
Who benefits from manipulating this system?
What data does this AI have access to that it shouldn't reveal?
If the AI acts autonomously, what's the blast radius of a wrong action?

These questions belong in the product spec, not a separate security review. By the time security reviews the feature, the architecture is set and the guardrails are afterthoughts.

Adversarial testing as practice

Red-team your AI before users do. This means:

Try to make it reveal system instructions
Try to make it perform actions outside its intended scope
Feed it edge-case inputs designed to confuse
Test boundary conditions between what it should and shouldn't do

Build these tests into your CI/CD pipeline. Run them on every prompt change, every model update, every context modification. Manual security reviews don't scale; automated adversarial tests do.

Progressive trust

Not every AI feature needs the same security posture. Match the guardrail investment to the risk:

Read-only AI (summarization, search, analysis): Focus on data leakage and prompt injection
Draft-mode AI (generates content for human review): Add output validation
Autonomous AI (takes actions directly): Full stack required, plus human-in-the-loop for high-impact operations

The agent experience and agentic UX practices cover how to design these trust levels into the user experience.

The CARATS security dimension

Security is the "S" in the CARATS framework. When assessing your team's security posture:

Red (needs attention): No adversarial testing, no guardrail layers, security hasn't been discussed
Yellow (at risk): Some guardrails in place, but not tested adversarially. Security reviewed once but not continuously monitored
Green (healthy): Defense in depth implemented, adversarial tests in CI/CD, security boundaries documented and enforced, incident response plan exists

Most teams I assess land at Yellow. They've added a content filter and called it done. The gap between Yellow and Green is systematic adversarial testing and continuous monitoring.