TDD with AI agents: Red, Green, Refactor still works

I wrote about multi-agent coordination in Copilot and briefly showed a TDD orchestration pattern with Red, Green, and Refactor agents. That section was a teaser. I wanted to go deeper.

I am a big fan and a practitioner of XP and TDD. but i have been experimenting with TDD with AI agents on real tasks. Not toy examples. Real features with edge cases, real bugs that surfaced during testing, and real frustration when the agent decided my test was wrong instead of the code.

The result surprised me. TDD is not just compatible with AI agents, it is one of the best ways to keep them honest.

Why TDD matters more with AI

When you write code yourself, you have a mental model of what the code does. You can trace logic in your head. You catch mistakes because you understand the intent behind every line.

AI agents do not have that. They generate code that looks right. It compiles. It might even pass a few tests. But they do not understand intent the way you do. They pattern-match against training data and produce plausible output.

This is exactly the problem TDD was designed to solve. Not for AI, originally, but for humans who make mistakes. The tests are a specification. They define what “correct” means before any implementation exists. When the implementation comes from an AI agent, having that specification in place is even more important.

Paul Sobocinski from Thoughtworks wrote about this back in 2023. His team found that TDD provides two things AI desperately needs: fast feedback and divide-and-conquer problem solving. I agree with both, but I want to add a third: TDD creates a contract that the agent cannot renegotiate.

When you tell an agent “implement this feature,” it decides what “done” looks like. When you hand it a failing test, the test decides what “done” looks like. That difference changes everything.

The classic cycle, adapted for agents

TDD has three steps. Red: write a failing test. Green: write the minimum code to pass. Refactor: clean up without breaking anything.

In traditional development, one person does all three. With agents, each step benefits from a different mindset, different tools, and sometimes even a different model.

flowchart LR
  subgraph red["Red Phase"]
      direction TB
      R1["Read existing code"] --> R2["Write failing test"]
      R2 --> R3["Verify test fails"]
  end
  subgraph green["Green Phase"]
      direction TB
      G1["Read failing test"] --> G2["Write minimal code"]
      G2 --> G3["Run tests - all pass?"]
      G3 -->|no| G2
  end
  subgraph refactor["Refactor Phase"]
      direction TB
      RF1["Analyze code smells"] --> RF2["Improve structure"]
      RF2 --> RF3["Run tests - still pass?"]
      RF3 -->|no| RF2
  end
  red -->|tests fail| green
  green -->|tests pass| refactor
  refactor -->|next feature| red

figA-1: The Red-Green-Refactor cycle with specialized agents

Here is how I set up each agent and why the separation matters.

Red agent: the test writer

The Red agent has one job. Write tests that fail. That sounds simple, but getting it right is the foundation of everything else.

---
name: Red
description: Write failing tests that define feature behavior
tools: ['read', 'search', 'edit', 'terminalCommand']
user-invocable: false
---

You are a test specification writer. Your job:

1. Read the feature request carefully
2. Search existing tests for patterns and conventions
3. Write new tests that define the expected behavior
4. Run the tests and verify they FAIL

Rules:
- Never write implementation code
- Tests must follow existing project conventions
- Each test should test one specific behavior
- Use descriptive test names that explain the expected behavior
- After writing tests, run them and confirm they fail

The critical constraint is terminalCommand. The Red agent must be able to run the tests. Writing a test that you think will fail is not the same as writing a test that actually fails. I learned this the hard way when the Red agent wrote a test for a feature that already existed. The test passed immediately. That is not red, that is a wasted cycle.

By giving the Red agent terminal access to run tests, it verifies its own work. The test must fail before we move forward.

Green agent: the minimalist coder

The Green agent receives the failing tests and writes the minimum code to make them pass. Nothing more. No optimization. No cleanup. Just make the red go green.

---
name: Green
description: Write minimal code to pass failing tests
tools: ['read', 'edit', 'terminalCommand']
user-invocable: false
model: ['Claude Haiku 4.5 (copilot)', 'GPT-5-mini (copilot)']
---

You are a minimalist implementer. Your job:

1. Read the failing tests
2. Understand what behavior they expect
3. Write the smallest amount of code that makes all tests pass
4. Run the tests to verify they pass

Rules:
- Do NOT optimize or refactor
- Do NOT add features beyond what tests require
- Hard-coded returns are acceptable if they pass the test
- If one approach fails, try a simpler one
- Ask for help if tests cannot be passed with minimal changes

I use a cheaper, faster model for the Green agent. Sonnet-class or mini models work fine here because the task is constrained. The tests tell the agent exactly what to do. There is no ambiguity. The cheaper model also responds faster, which keeps the Red-Green loop tight.

The “hard-coded returns are acceptable” instruction is important. In classic TDD, Kent Beck teaches you to first make the test pass with a literal value, then generalize. AI agents hate this. They want to write the “real” implementation immediately. Explicitly telling them that hard-coded values are fine forces the baby-step discipline that makes TDD work.

Does it always take baby steps? No. About half the time it still jumps ahead to a full implementation. But when it does take a baby step, the next iteration is cleaner because the tests drive the direction.

Refactor agent: the quality gate

After the tests pass, the Refactor agent improves the code without changing behavior. It gets the full model because quality decisions need deeper reasoning.

---
name: Refactor
description: Improve code quality without breaking tests
tools: ['read', 'search', 'edit', 'terminalCommand']
user-invocable: false
---

You are a refactoring specialist. Your job:

1. Read the current implementation and tests
2. Search for patterns in the rest of the codebase
3. Identify improvements: duplication, naming, structure, performance
4. Apply changes one at a time
5. Run tests after each change to verify nothing breaks

Rules:
- All tests must pass after each change
- Do NOT add new features or change behavior
- Match existing codebase conventions
- If a refactor breaks tests, revert it immediately
- Explain each refactoring decision briefly

The “one change at a time” rule is key. I tried letting the Refactor agent make multiple changes at once. It would rename a function, extract a method, and change a data structure all in one pass. When a test broke, nobody knew which change caused it. By requiring incremental changes with test runs between each, failures are easy to diagnose.

The TDD coordinator

The coordinator ties the three agents together. It never writes code itself. It manages the flow and checks that each phase completed correctly.

---
name: TDD
description: Test-driven development with Red-Green-Refactor
tools: ['agent', 'read', 'search', 'terminalCommand']
agents: ['Red', 'Green', 'Refactor']
---

You coordinate test-driven development. For each feature:

1. Use the Red agent to write failing tests for the feature
2. Verify the Red agent's tests actually fail by running them
3. Use the Green agent to make the tests pass
4. Verify all tests pass by running them
5. Use the Refactor agent to improve the code
6. Verify all tests still pass

If any verification step fails, investigate and retry. Report progress after each phase.

Notice the coordinator also has terminalCommand. It independently verifies what each agent did. Trust but verify. The Red agent says the tests fail? The coordinator runs them to confirm. The Green agent says they pass? The coordinator checks. This double-verification catches cases where an agent reports success but something actually went wrong.

What I learned from two weeks of real usage

Here is the practical truth, not the theory.

Tests as a contract actually works

The biggest win is that tests prevent drift. In my multi-agent post, I mentioned that agents sometimes ignore their own plans. With TDD, the plan is not a markdown document that the agent can reinterpret. The plan is executable tests. Either the code passes them or it does not. There is no room for creative reinterpretation.

I had a case where the Green agent decided my test was “incorrect” and tried to modify it instead of writing the implementation. The tool restrictions stopped it, because the Green agent only gets edit access to implementation files, not test files. When your Red agent writes to tests/ and your Green agent can only write to src/, you get a clean separation that prevents this kind of drift.

How did I scope file access? Custom .instructions.md files per agent that explicitly state which directories are in scope. It is not perfect enforcement, but agents follow it most of the time.

Cheap models shine in the Green phase

For the Green phase, a smaller model is faster and cheaper. The constraints are so tight, the tests tell it exactly what to do, that a fast model produces the same result as a premium one. I saved roughly 40% of my premium requests by using a mini model for Green.

The Red and Refactor phases are different. Writing good tests requires understanding the feature domain. Refactoring requires understanding code patterns and making judgment calls. Those need the full model.

The baby-step problem

AI agents struggle with baby steps. Kent Beck’s classic TDD says: for a function that adds two numbers, first return a hard-coded value, then generalize. AI agents see this as stupid. They know the answer. Why return a hard-coded 4 when they can write return a + b?

Most of the time, jumping ahead works fine. But for complex features, it causes problems. The agent writes a complete implementation that passes the first test but is wrong in subtle ways that later tests reveal. Then it has to rework a large chunk of code instead of building up incrementally.

I do not have a perfect solution for this. Explicitly instructing baby steps helps about half the time. What works better is writing more tests in the Red phase, smaller and more granular, so the Green agent cannot jump too far ahead.

Refactor is where agents add the most value

Honestly, I was not expecting this. The Refactor phase is where AI agents consistently impressed me. They spot duplication that I miss. They suggest extractions that make the code cleaner. They notice when a variable name does not match the concept it represents.

The key is that the tests provide a safety net. Without tests, refactoring with AI is terrifying. You make changes and cannot be sure nothing broke. With tests, the agent can be bold. Rename things. Extract methods. Restructure data flows. If the tests pass, the refactoring is safe. If they fail, revert.

This is the same reason refactoring works well for human developers with TDD. The safety net is the same. The agent just moves faster.

The Red phase is where YOU add the most value

Writing tests is where human judgment matters most. Tests encode business requirements. They define edge cases. They express what the system should do in words that map to stakeholder expectations.

AI agents can generate tests, but they generate generic ones. “Should return a value.” “Should not throw an error.” These test the obvious paths. The interesting tests, the ones that catch real bugs, come from understanding the business domain. “Should reject an order when inventory is reserved but not committed.” “Should handle the race condition between payment confirmation and stock update.”

Write those tests yourself. Or at least, write the test names and let the Red agent fill in the implementation. The names define the specification. That is where your expertise matters.

Comparing approaches from the wild

Several people and projects have explored TDD with AI. Here is what resonates with my experience.

The Thoughtworks team under Paul Sobocinski found that Copilot performs better when working with a high-quality TDD test suite. Better inputs lead to better outputs. My multi-agent setup reinforces this: the Red agent’s tests are the input for the Green agent, so test quality directly determines implementation quality.

Harper Reed described a workflow where you brainstorm a spec, then plan with a TDD-focused prompt, then execute step by step. His approach is sequential and single-agent. The multi-agent version gives each step its own specialist. Same idea, better separation.

Kief Morris at Thoughtworks wrote about being “on the loop” rather than “in” it. TDD agents are a practical implementation of this. You define the what (tests). The agents handle the how (implementation). You verify the result (tests pass). You are on the loop, steering through specifications rather than writing every line.

The Spec-Driven Development movement, analyzed by Birgitta Böckeler, pushes specs even further: write structured requirements documents and let agents generate code from them. My take is that executable tests are better specs than markdown documents. A spec can be misinterpreted. A test either passes or fails. TDD is spec-driven development where the spec is executable.

GitHub’s own cloud agent supports custom agents and skills. You could run the TDD cycle in a cloud session, where each Red-Green-Refactor round happens autonomously. I have not pushed it that far yet. The feedback loop for cloud sessions is slower, and I prefer tighter cycles where I can review test results between rounds.

A practical example

Let me walk through one real cycle. I needed to add a rate limiter to an API endpoint.

Red phase. I described the feature: “Rate limit requests to 100 per minute per API key. Return 429 when exceeded. Include retry-after header.” The Red agent searched my existing test files, found the testing patterns I use, and wrote five tests:

Should allow requests under the limit
Should reject the 101st request with 429 status
Should include retry-after header when rate limited
Should reset the counter after one minute
Should track limits independently per API key

All five failed. Good.

Green phase. The Green agent read the tests and wrote an in-memory rate limiter. Simple dictionary with timestamps. Nothing fancy. All five tests passed.

Refactor phase. The Refactor agent noticed that the rate limit values were hard-coded. It extracted them to configuration. It also noticed the in-memory store would not work with multiple server instances and added a comment suggesting a Redis-backed store for production. It did not implement Redis, just flagged the concern. All tests still passed.

Total time: about twelve minutes. The equivalent without TDD agents, just asking a single agent to “add rate limiting,” tends to produce a more complete but harder-to-verify result. I get code that looks right but takes longer to trust.

When TDD agents do not work well

I have to be honest about the limits.

Exploratory work. When you do not know what you are building, you cannot write tests first. TDD assumes you can define the expected behavior before implementing it. If you are prototyping, spike first, then TDD the thing you want to keep.

UI-heavy features. Visual behavior is hard to express in unit tests. You can test component logic, but testing whether something “looks right” requires different tools. TDD agents work best for backend logic, algorithms, APIs, and data transformations.

Tightly coupled systems. When a feature touches many parts of the codebase, the Red agent struggles to write focused tests. It writes integration tests that depend on the full system, which are slow and fragile. For these cases, write the test structure yourself and let the agents fill in details.

Very small changes. For a one-line bug fix, spinning up three agents is overkill. Just fix it. TDD agents earn their cost on medium-to-large features where the implementation has enough complexity to benefit from step-by-step discipline.

Getting started this week

If you want to try TDD agents without committing to a big setup, start simple.

Create just two agents: Red and Green. Skip the Refactor agent for now. Put them in .github/agents/. Use the configurations from this post as a starting point.

Pick a small feature you need to build. Ask the Red agent to write tests. Review them. Then ask the Green agent to make them pass. Review the code.

Notice the rhythm. Notice how the tests constrain the implementation. If it feels useful, add the Refactor agent later and the coordinator after that.

The real insight is not about AI agents specifically. It is that the discipline of TDD, define behavior before implementing it, remains one of the most effective ways to produce correct software. AI just makes each step faster without removing the need for the discipline.

References

TDD with GitHub Copilot by Paul Sobocinski (Thoughtworks)
Humans and Agents in Software Engineering Loops by Kief Morris (Thoughtworks)
Understanding Spec-Driven-Development by Birgitta Böckeler (Thoughtworks)
My LLM codegen workflow atm by Harper Reed
About GitHub Copilot cloud agent
Custom agents in VS Code
My earlier post on multi-agent coordination