Core DeliveryIntermediate8 min read

Test-Driven Development

Writing tests first isn't just for engineers -- it's how product teams verify that what gets built is what was intended

TDD is a development discipline where you write a failing test before writing any production code. The cycle is short and repeating:

  1. Red -- write a test that describes what the code should do. Run it. Watch it fail.
  2. Green -- write the minimum code to make the test pass. Nothing more.
  3. Refactor -- clean up the code while keeping all tests green. Remove duplication, improve names, simplify structure.

Then repeat. Every feature grows one passing test at a time.

Why TDD matters more with AI

When AI agents write code, TDD becomes the control mechanism -- the way you verify that generated code actually does what you intended. (Where your team sits on the AI maturity spectrum determines how much of this cycle the agent handles.)

Without TDD, you're reviewing AI-generated code by reading it. With TDD, you're verifying it by running it. Tests are faster, more reliable, and more complete than human code review alone.

The AI-augmented version of TDD works like this:

  1. Human writes acceptance criteria -- in a testable format (Given/When/Then works well)
  2. Agent generates failing tests from the acceptance criteria
  3. Human reviews the tests -- are they testing the right things? Are edge cases covered?
  4. Agent writes implementation code to make the tests pass
  5. Human reviews the code -- does it match expectations? Any security or performance issues?
  6. Both refactor -- agent suggests cleanup, human approves

The key shift: the developer's primary job moves from writing code to reviewing and validating it.

How to do it

Starting a new feature

  1. Pick the first acceptance criterion from the story
  2. Write one failing test that captures that criterion
  3. Run the test -- confirm it fails (Red)
  4. Write the simplest code to make it pass (Green)
  5. Refactor if needed -- then move to the next criterion

Fixing a bug

  1. Write a failing test that reproduces the bug
  2. Confirm the test fails (this proves you've captured the bug)
  3. Fix the bug -- the test should now pass
  4. Check that no other tests broke

Working with legacy code

When modifying code that has no tests, write pinning tests first -- tests that capture the current behavior, whether that behavior is correct or not. These protect you from unintended changes while you refactor or extend the code.

Pinning tests are especially important when AI agents modify legacy code. They prevent the agent from silently changing existing behavior.

Testing AI-powered features

AI features break a core TDD assumption: determinism. The same prompt can produce different outputs. This doesn't mean TDD doesn't apply -- it means you test differently.

What you can test deterministically

Even with non-deterministic AI outputs, most of the code is fully testable with traditional TDD:

  • Input validation and preprocessing
  • Output parsing and formatting
  • Error handling (timeouts, rate limits, malformed responses)
  • Routing logic (which prompt or model gets called)
  • Guardrail enforcement (blocking outputs that violate policies)
  • Context assembly (RAG results, user profile, conversation history)

This is 60-80% of the code. Write standard TDD tests for all of it.

What to test differently

For the AI output itself, shift from exact-match assertions to structural and property-based assertions:

  • Format properties: "Output is valid JSON," "Output has exactly 3 sections"
  • Content boundaries: "Output never mentions competitors," "Output never includes PII patterns"
  • Consistency properties: "Given the same user profile, the recommendation category is stable across runs"
  • Length constraints: "Output is between 50 and 200 tokens"

Statistical testing

When output quality varies, use statistical assertions:

  • Run the prompt N times and check that at least 4 out of 5 pass the quality check
  • Use rubric scoring on a test dataset -- check that average score meets the threshold
  • Run the same input multiple times and check that outputs are semantically similar

When TDD is the wrong paradigm

TDD works poorly for creative generation with no single "right" answer, open-ended conversation where quality is subjective, and exploratory features where desired behavior is still being discovered. Even in these cases, TDD applies to the infrastructure around the AI.

What each role prepares

RolePreparation
EngineerKnow the test framework. Have the test runner configured. Know how to run a single test
PMWrite acceptance criteria in a testable format -- these become the test specifications
DesignerClarify interaction details -- what should the user see, click, or experience? These inform what the tests verify

Common pitfalls

  • Writing tests after the code -- that's "test-after development," not TDD. The test must fail first. (If quality is slipping, a delivery diagnostic can reveal whether TDD is being skipped under pressure.)
  • Testing implementation details -- test behavior, not internals. If you refactor and the tests break, they were testing the wrong thing
  • Large tests -- each test should verify one thing. If a test name includes "and," split it
  • Skipping the refactor step -- red-green without refactor leads to working but messy code
  • Trusting a single test run with AI code -- run tests multiple times. Non-deterministic code generation means one passing run isn't enough

Try this today

Next time you fix a bug, write the failing test first. Before you touch the implementation code, prove the bug exists with a test that fails. Then fix it. The test should pass. You just did TDD -- and you have a regression test that protects that fix forever.

Want help with test-driven development?

I coach teams on this practice. Let's talk about your situation.

Get in touch