The End of the "One-Prompt Wonder" – Why We Need Agent Teams

Why simple prompting and "vibe coding" fail on complex projects – and the four systemic problems that forced me to rethink my entire development workflow.

05/18/2026

•13 min read

Part 1 of 9 of the "Agentic Coding" seriesGo to series overview →

Note on the Content

This post series describes my insights from using AI in the software development process. These are personal experiences and findings from which I derived measures that worked for my projects. This series does not claim to be a comprehensive guide or universally applicable, but rather an inspiration for your own projects.

This is part one of a nine-part series. You can find all parts, in order, on the Agentic Coding overview page.

If you've been following my previous posts, you know the journey: starting with simple chat prompts on claude.ai, moving on to agentic coding with GitHub Copilot, and then introducing copilot-instructions.md and AGENTS.md to finally give the assistant clear rules of engagement. Each of those steps improved my workflow – and at the same time exposed new, deeper problems.

At some point I realized that the real problem couldn't be solved with better instructions. It's systemic in nature. In this post I want to describe four core problems I kept running into, and explain why they forced me to rethink my entire approach.

The Promise of "Vibe Coding"

The term "vibe coding" was coined by Andrej Karpathy and describes a development approach where you describe a vague idea to the model and the AI handles the rest. You work on an intuitive, fast level – no architecture plan, no spec document, just prompting and seeing what emerges. At first it feels like a superpower.

For small scripts, quick prototypes, or isolated helper functions, this works great. The problems start when the codebase grows, features build on each other, and the AI suddenly no longer knows what it built moments ago.

My experience after months of active use: vibe coding is a great accelerator for getting started, but a poor foundation for everything that follows.

I'm a trained computer scientist and software engineer, which naturally brings a software engineering mindset. I want to continue that engineering approach even with AI assistance. "Vibing" while coding is interesting, but it shouldn't be the goal. I want to build a sustainable, maintainable codebase that meets the demands of my projects. And for that, just "vibing" isn't enough – I need structure, planning, and clear responsibilities.

I kept asking myself: how can people, supposedly without software engineering experience, use this approach to build projects that real users actually use and pay for? Even though I, with a decade of experience in software development, can't get the AI agent to generate clean, non-duplicated code. I was close to despair – how can this be? Was my years of study pointless, did it amount to nothing?

I can reassure all computer scientists: not everything that glitters is gold – and indeed, the developer community experiences the same pain I went through. Anyone reading the relevant subreddits finds a consistent picture:

The "Month 3 Wall": Many projects run smoothly at first because the AI locally optimizes each prompt. After about three months the system collapses: there's no global architecture, and the app is held together by literal duct tape – because nobody defined the overarching structure. The code works in the small, but at scale it has long diverged from the original design. (r/vibecoding)
State Drift and Code Duplication: As features grow, the model loses track of existing state. Instead of refactoring existing logic, it simply duplicates it – tables, functions, and classes get recreated even though they already exist. A frequently cited piece of advice: "Freeze your code early," before the duplication gets out of hand. (r/vibecoding)
Exploding Token Costs from Dead Code: Vibe coding quickly produces unused code – orphaned types, duplicate functions, helper methods that are never called. This ballast clogs the context window in subsequent prompts. The model needs more and more tokens while output quality simultaneously declines. A kind of negative spiral that creeps up slowly. (r/cursor)
Security Debt on Autopilot: In fast mode the AI ignores standard security policies by default. Authentication interfaces are wired up sloppily, environment variables land in the wrong places, validations are missing. What initially looks like a finished product is quietly accumulating security debt in the background. (r/nocode)
The Illusion of the Finished v0 Prototype: The AI quickly delivers a UI that looks good – creating the illusion that you're almost done. The real, expensive problems lie in the 99% of invisible code underneath: scalability, edge cases, data compliance. Without a strict plan, these areas remain broken. Debugging in production becomes a nightmare. (r/vibecoding)

This is no coincidence and no streak of bad luck for individual developers. It's a pattern that emerges from unstructured use of AI coding tools. It shows that vibe coding can be a powerful technique, but it's not the right approach for complex, long-term projects. Even AI needs a structured process right now. Especially for complex tasks, it's important to guide the AI through the problem. The principle at work here is ancient: divide and conquer – or in Latin: Divide et impera.

Historical Background: Divide et impera – Divide and Conquer

Philip II of Macedon used it in the 4th century BC to play the Greek city-states against each other instead of fighting a united front. The Romans elevated it to a foreign policy maxim: they forged unequal alliances with individual tribes to prevent a common front. Julius Caesar applied the same principle with great success in the Gallic Wars. And Sun Tzu already described a very similar splitting strategy around 500 BC in The Art of War: whoever approaches a complex challenge as a whole loses – whoever breaks it into manageable parts wins.

What antiquity recognized for warfare and politics officially entered computer science in 1945. John von Neumann built Merge Sort as the first formal divide-and-conquer algorithm – because the limited memory capacity of the EDVAC forced large sorting tasks to be split into manageable halves. In the decades that followed came Quicksort, the Karatsuba algorithm, and the Fast Fourier Transform – all built on the same insight: whoever keeps a problem within its limits loses. Whoever breaks it apart wins.

The pattern is consistent: whenever a problem became too large for a single pass, decomposing it into focused sub-problems was the decisive breakthrough idea. And for the use of AI in particular, this is truer than ever. An LLM has a limited context window. Every token that lands in it consumes part of the model's available attention. Whoever packs a complex task – requirements analysis, architecture decisions, and code implementation – into a single prompt overloads that window. The result is a model that has to focus on too many things at once and loses quality on every sub-problem. Small, focused tasks, by contrast, let the model work in its element: precise, consistent, reliable.

Diagramm wird geladen …

A single prompt mixes three responsibilities – decomposition separates them cleanly.

The Four Problems That Forced Me to Change Course

1. The Concept Problem: The AI Invents Instead of Understanding

The first and most fundamental problem barely hits you at the start because you mistake it for a strength. You describe a feature and the AI builds it immediately – fast, complete, convincing. What you only realize later: the AI came up with the concept itself.

In practice it looks like this: I ask the assistant to implement a new filter function for an existing list view. The generated code works right away. On closer inspection I notice that the model built its own state management logic – parallel to the one that already exists in the project. It didn't integrate the feature into the existing architecture; it built next to it.

This isn't due to a lack of capability in the model, but to the fact that it had no explicit architecture plan to orient itself by. It improvised the concept – and improvisation in a grown codebase almost inevitably leads to inconsistencies.

2. The Rule Violation Problem: Guidelines Are Ignored

In the post about copilot-instructions.md and AGENTS.md, I described how to give the assistant project-specific instructions. I defined coding guidelines, set architecture rules, and even explicitly wrote down which libraries to use and which to avoid.

That helped – but not reliably.

What's frustrating: I could see in the chat that the agent had read my copilot-instructions.md as context. The file was active, sitting at the top of the window. And yet: a function for which I had prescribed Pinia as state management was suddenly solved with local ref state. A component I explicitly wanted configured as a Server Component ended up as a Client Component in the generated code. My guidelines were right in front of its eyes – and were ignored.

Not an error in the strict sense – the code ran. But it ran against my own conventions. And the worst part: it never happened on the first simple call, only when a request became complex. The more attention the model needed for the actual task, the less remained for following my rules.

3. The Duplication Problem: The Wheel Is Reinvented Every Time

This problem cost me the most time and nerves. I ask the AI to build a new function. It builds it – and in doing so, without hesitation, writes three helper functions that already exist in my project. Utility methods for date formatting, a custom error handler class, a bespoke HTTP client setup – all already present, all reimplemented.

The result is code duplication on a grand scale. At some point I had database access implemented in four different ways in my codebase – all functionally correct, all redundant. Refactoring effort that stretched over days.

The model simply had no complete overview of what already existed. It could see the code directly provided in context, but not the deeper-lying infrastructure. So it just built its own.

4. The Focus Problem: Requirements, Concept, and Code All at Once

The hardest to name but perhaps most consequential problem. When you write an extensive prompt that simultaneously explains the requirements, describes the desired concept, and demands the implementation, the following happens: the model tries to do everything at once.

It analyzes the requirements. It designs a concept. It writes the code. All in a single step.

Sounds efficient. But it isn't. A concrete example from my own work: I wanted to extend an existing authentication feature with an OAuth flow. My prompt contained the functional requirements, a description of the desired flow, and the direct request to write the necessary classes. The result worked – but during the implementation the agent had built a completely new session management structure and simply bypassed my existing one. It had completed all three tasks simultaneously and taken shortcuts in each one that I only discovered days later.

The requirements analysis stays superficial, the concept is never truly validated, and the code contains subtle errors that stem from the previous weaknesses. The more I demand in a single prompt, the less I get from each individual part.

Why These Aren't Bugs, but Architecture

I spent a long time trying to solve these four problems through better prompts, more detailed instruction files, or more context in the window. Without lasting success. Only through engaging with the research literature did it become clear to me: these problems aren't bugs that prompt tuning can fix. They are consequences of the fundamental architecture of language models.

Lost in the Middle: Researchers from Stanford and Berkeley have shown that the performance of LLMs drops significantly when relevant information sits in the middle of a long context (Liu et al., 2024). When your coding guidelines and existing code fill the context window, the AI "forgets" the rules in the middle as soon as it has to concentrate on the complex logic at the end.
Context Rot and Redundancy: An analysis of over 33,000 agent-written pull requests on GitHub showed that rejected PRs systematically exhibit larger code changes, more changed files, and more duplication (Ehsani et al., 2026). The study explicitly identifies "duplicate PRs and unwanted feature implementations" as typical symptoms of agent misalignment – exactly what I experienced as the duplication problem.
Hallucinations from Missing Project Context: Researchers have identified three main categories of code hallucinations, with project context conflicts responsible for 24.56% of all errors (Hallucination in LLM-based Code Generation, 2024). Added to this are requirement conflicts (43.53%) – the model generates code that doesn't match what was actually requested. Both are not careless mistakes, but structural weaknesses that occur when the model has to resolve too many unknowns at once.

The pattern is the same everywhere: models perform well as long as the context is small, clear, and focused. As soon as tasks grow, contexts get longer, and multiple responsibilities flow together in a single prompt, quality systematically breaks down.

The Path That Revealed Itself

If the problem is systemic, it requires a systemic solution. More prompt text doesn't fix an attention problem – it makes it worse. What truly helps is the opposite: split the context, separate the responsibilities, keep the focus of each step narrow.

In my new workflow I consistently separate three phases that I previously crammed into a single prompt:

Analysis: What needs to be built? What are the use cases?
Concept: How should it be built? What is the technical architecture?
Implementation: Write the code – strictly according to plan.

Each of these phases gets its own specialized agent with a small, precise context. No agent needs to know everything. Each agent does exactly what it does best.

This sounds like more work – and at first it is. But it's the kind of extra effort that saves massive amounts of time in the long run, because code duplication, architecture breaks, and rule violations either never arise in the first place or are dramatically reduced.

How this setup looks in detail, which agents take on which roles, and how the handoff between them works – I'll describe all of that in the following posts of this series.

Conclusion

Vibe coding is not a mistake. It's a powerful tool – for the right task in the right phase. For quick experiments, isolated functions, or prototypes, it's hard to beat.

But for what I do every day – developing complex features in a grown codebase with clear architecture requirements – it falls short. The four problems I described are not bugs in the system. They are properties of a system that, at this point in time, has a limited context window and cannot cleanly handle large task descriptions in a single prompt.

The conclusion I drew: I no longer think in prompts, but in processes. And processes need roles – specialized agents that do exactly one thing very well, instead of many things mediocrely.

In the next post of this series I'll introduce the first two agents that handle planning in my workflow: the Business Analyst and the Solution Architect.