Skip to main content
Ai StrategyIntermediate7 min read

Security Thinking

How to think about AI security as a product concern, not just an engineering checklist

Traditional software security has decades of hard-won wisdom: validate inputs, sanitize outputs, follow least privilege, assume breach. AI systems inherit all of those requirements and add new ones that most teams aren't thinking about yet.

The gap isn't awareness - most teams know prompt injection exists. The gap is practice. Teams acknowledge the risk in a design doc and then ship without testing for it. Security thinking for AI means integrating adversarial analysis into your regular development process, not treating it as a one-time review.

The AI risk landscape

Six risks show up repeatedly in production AI systems:

RiskWhat happensExample
Prompt injectionAdversarial input overrides system instructionsUser input containing "ignore previous instructions" changes AI behavior
System prompt leakageAI reveals its hidden instructionsUser asks "what are your instructions?" and the AI complies
Data leakagePrivate data surfaces in outputsAI trained on customer data reveals one customer's info to another
Cross-tenant contaminationOne user's data influences another's responsesShared context or caching leaks information across sessions
Unauthorized actionsAI performs high-impact operations without approvalAgent sends an email, modifies a record, or initiates a transaction without human review
Adversarial manipulationAttacker poisons input data to skew AI behaviorManipulated training data or behavioral baselines produce biased outputs

The four-layer guardrail stack

Effective AI security uses defense in depth. No single layer catches everything.

Layer 1: Content safety. Filter inputs and outputs for policy violations - hate speech, violence, self-harm content. This is table stakes, not the whole strategy.

Layer 2: Input pattern filters. Scan for suspicious patterns in user input: hidden instructions, bypass attempts, prompt injection signatures, encoded payloads. This catches the obvious attacks.

Layer 3: Role-based tool gating. Limit what actions the AI can take based on user permissions. Send, delete, trade, and publish operations require explicit authorization. An AI agent should never have more permissions than the human it's acting for.

Layer 4: Output validation. Screen AI responses before they reach the user. Check for unexpected tokens, PII leakage, excessive length, off-brand language, and responses that don't match the expected format. This is your last line of defense.

Making security part of the process

Threat modeling for AI features

Before building an AI feature, ask:

  • What's the worst thing this AI could say or do?
  • Who benefits from manipulating this system?
  • What data does this AI have access to that it shouldn't reveal?
  • If the AI acts autonomously, what's the blast radius of a wrong action?

These questions belong in the product spec, not a separate security review. By the time security reviews the feature, the architecture is set and the guardrails are afterthoughts.

Adversarial testing as practice

Red-team your AI before users do. This means:

  • Try to make it reveal system instructions
  • Try to make it perform actions outside its intended scope
  • Feed it edge-case inputs designed to confuse
  • Test boundary conditions between what it should and shouldn't do

Build these tests into your CI/CD pipeline. Run them on every prompt change, every model update, every context modification. Manual security reviews don't scale; automated adversarial tests do.

Progressive trust

Not every AI feature needs the same security posture. Match the guardrail investment to the risk:

  • Read-only AI (summarization, search, analysis): Focus on data leakage and prompt injection
  • Draft-mode AI (generates content for human review): Add output validation
  • Autonomous AI (takes actions directly): Full stack required, plus human-in-the-loop for high-impact operations

The agent experience and agentic UX practices cover how to design these trust levels into the user experience.

The CARATS security dimension

Security is the "S" in the CARATS framework. When assessing your team's security posture:

  • Red (needs attention): No adversarial testing, no guardrail layers, security hasn't been discussed
  • Yellow (at risk): Some guardrails in place, but not tested adversarially. Security reviewed once but not continuously monitored
  • Green (healthy): Defense in depth implemented, adversarial tests in CI/CD, security boundaries documented and enforced, incident response plan exists

Most teams I assess land at Yellow. They've added a content filter and called it done. The gap between Yellow and Green is systematic adversarial testing and continuous monitoring.

Want help with security thinking?

I coach teams on this practice. Let's talk about your situation.

Get in touch