The Verifier – The E2E Test Agent

Why a green build doesn't mean a feature works – and how a fifth, specialized agent turns the use case into an executable end-to-end test, closing the loop back to the original requirement.

06/08/2026

10 min read

A note on this content

This series describes my insights from using AI in the software development process. These are personal experiences and conclusions from which I derived measures that worked for my projects. This series does not claim to be a comprehensive guide or universally valid – it's meant as inspiration for your own projects.

In the previous post I introduced the realization-plan agent and the developer agent – the makers on my team. At the end of that post I announced that the four-agent team would be rounded out by a few specialized additions. This post delivers on that promise – with an addition that, for me, isn't optional but the last missing link in the chain: the E2E test agent.

The four agents from the last posts have a deliberate gap. They plan, design, decompose, and implement – but none of them verifies automatically, at the end, whether the finished feature actually does in a real browser or simulator what the use case described. That's exactly the gap the fifth agent closes. I initially worked with the developer agent also writing the unit tests at the end of a feature. But that didn't give me the results I was looking for.

Diagramm wird geladen …
The complete team – the E2E test agent as the final quality gate after the developer.

The problem after the code: green isn't the same as done

The developer agent ends every step with a validation command – npm run build, a type check, maybe a linter. If the command is green, the step counts as complete. That's good and necessary. But it proves exactly one thing: the code compiles. It does not prove that the feature works.

A green build says nothing about whether the search bar actually shows results as you type, whether the dropdown closes again when you clear the field, whether the correct message appears on zero hits – or in a mobile app, whether a tap navigates to the right screen and whether a network error shows the correct error message. Those are statements about behavior – and behavior can't be checked by a compiler, only by actually operating the application.

In my old workflow I did exactly that by hand: start the dev server, click, type, look. For a single feature that's doable. The problem isn't the first run – it's the tenth. Every new feature can break an old one, and nobody clicks through the entire existing feature set again after every change. This is precisely where the regressions accumulate that you only notice in production.

The human-in-the-loop checkpoints from the previous posts don't catch this. As I described in post 12 and post 13, they're deliberately aimed at direction – requirement, concept, plan. They make sure the right thing gets built. They don't make sure the built thing keeps working. That's a different question, and it needs a different tool.

The E2E test agent: the use case becomes executable

The E2E test agent runs after the developer agent has completed the last step of the plan. Its job: write an end-to-end test that exercises the freshly implemented feature in a real browser, run it, and report the result.

The crucial part is where it gets its test cases. It doesn't invent them. It reads the use-case document from docs/usecases/, the one created at the very beginning by the business analyst and approved by me – and translates its structure directly into test scenarios:

  • The main flow (happy path) becomes the central test scenario.
  • Each alternative flow becomes its own test case.
  • Each error case becomes a test that forces and checks the defined error behavior.

This closes a loop that holds the whole series together. The use case was the single source of truth for the requirement from the start. The solution architect turned it into a concept, the realization plan into a battle plan, the developer into code. And now the same document becomes the yardstick. The requirement I confirmed at the beginning is exactly what gets tested against at the end – not an interpretation of it that the test agent improvises along the way.

Diagramm wird geladen …
The loop closes: the use case's paths become executable test cases.

In my Nuxt stack the tool of choice is Playwright – the agent starts the dev server, drives a real browser, types into the search bar, waits for the dropdown, and checks the hits. In my Flutter project a Flutter integration test takes over: the agent starts the simulator, navigates through the app, performs gestures, and checks whether the widgets show the expected state. The framework changes – the principle doesn't: the test operates the application the way a user would, and compares the actual behavior with what the use case promised.

Template to take away: E2E test agent

I've published the agent prompt as a template on GitHub: testing-agent-playwright and testing-agent-flutter. As with the other agents: it's a starting point, not a finished setup – the test framework (Playwright or Cypress for web, Flutter Integration Tests for mobile), the directory conventions (tests/e2e/), the selector strategy, and the way the use case is mapped onto test cases all need to be adapted to your stack. An overview of all templates is in the README.

Why the developer agent doesn't write the tests

The obvious question: if the developer agent built the feature – why doesn't it just write the tests too? After all, it knows the code best.

That's exactly the problem. It knows the code too well.

An agent testing its own implementation tends to test what it built – not what was required. If it made an implicit assumption while implementing, the same assumption flows into the test. The test goes green because it mirrors the reality of the code, not the expectation of the requirement. Because the developer doesn't get the requirement in the form of the use case – it only gets the realization plan and the solution concept. Since the developer agent operates solely on its freshly implemented code, this can lead to a classic confirmation bias, cast in software: the examinee writes its own exam question. I noticed this particularly when implementing unit tests. I often had many unit tests and high coverage – but the feature still didn't work, because the tests weren't checking what actually mattered. Which of course frustrated me greatly, because I thought I had done everything to make sure it worked.

That's why the E2E test agent deliberately gets a different context. It doesn't see the plan or the implementation details. It sees the use case – the requirement in its original, human-confirmed form – and the running application. The real value comes from that separation: it checks against the expectation, not against the result. When it finds a discrepancy, that's a genuine signal – and not the echo of an assumption that was already baked into the code.

It's the same principle that carries the whole series: each agent gets exactly the context its task needs – and deliberately not more. Focus through constraint.

Human-in-the-loop: I review the test, not just the result

This is the most subtle but most important point about this agent. A green test run is seductive. It feels like safety. But a green test that checks nothing meaningful is more dangerous than no test at all – because it fakes a safety that doesn't exist.

In post 12 I described the "fake control": a human-in-the-loop that nominally exists but in practice just waves things through. With tests this trap has its own shape. A test can be green because the feature works – or because the test asserts nothing solid. A click without a subsequent check. An assertion on an element that's always there. An expect that can never fail.

That's why my checkpoint here is not on the green checkmark, but on the test code itself. I read what the agent checks. Does the test really reproduce the main flow from the use case? Does it check the error case by actually forcing the error – or does it only pretend to? Are the assertions specific enough that they'd fail if the feature broke?

This is a compact, easily judgeable review – a handful of test cases, in my language, against a document I approved myself. Exactly the kind of checkpoint I defend throughout this series: not the tired waving-through of a thousand lines, but a targeted look at the right level of abstraction, at the right moment.

And the test stays. Unlike the manual click-through that evaporates when you close the browser, the E2E test moves into the repository and runs again on every future change. A one-time "it works" becomes a lasting "it still works." That's where the investment pays off: the test the agent writes for search today protects search in six months from a change nobody is thinking about today.

Yes, this has its limits too

An E2E test agent is no substitute for a thought-out testing strategy. E2E tests are slow, they can be flaky, and they deliberately cover only the paths described in the use case – not every conceivable edge case. For fine-grained logic, unit tests remain the better tool, and those belong more in the developer agent's remit when it writes the respective function.

The E2E test agent checks the feature from the user's bird's-eye view – exactly at the level the use case is formulated. That's its strength and its limit at the same time. It tells me reliably: "The flow you ordered at the beginning works in a real browser." It doesn't tell me: "Every single helper function is correct in isolation." Both are needed. This agent delivers the first.

Conclusion

The E2E test agent is the point where the loop closes. What began as a vague wish and was cast into a use case by the business analyst returns at the end as the yardstick. The very requirement that started the whole process decides whether it succeeded.

With that, the team is complete: the business analyst and the solution architect make sure the right thing gets planned. The realization-plan agent makes the plan executable, the developer implements it. And the E2E test agent proves that the result actually does what was required at the start – traceable, repeatable, and permanently anchored in the repository.

Enough theory. In the last posts I explained why each of these agents exists and which problem it solves. In the next post I'll drop the curtain and show the entire workflow in action – from a real vague wish to a finished, tested commit, with all the handoffs, checkpoints, and corrections that were necessary along the way.