How Do I Measure the Positive Effect of Agents? My Metrics for Agent Performance

Which metrics can be used to measure the positive effect of AI agents in the development process? My approaches to agent performance measurement.

07/06/2026

13 min read

Note on the Content

This post series describes my insights from using AI in the software development process. These are personal experiences and findings from which I derived measures that worked for my projects. This series does not claim to be a comprehensive guide or universally applicable, but rather an inspiration for your own projects.

In the previous posts of this series I described why I left vibe coding behind for complex projects (post 11), how planning and implementation agents work together (post 12 and post 13), what a concrete run looks like (post 15), and how I create the agents in Claude Code and Copilot (post 16) and extend them via MCPs (post 17).

One question I have consistently sidestepped along the way – even though it practically forces itself on you after each of those posts: How do I actually know that all this setup is genuinely better than a simple prompt?

In this post I try to answer that honestly – and present which framework helps shift the question from "I feel more productive" to "I can prove it."

The Problem with Gut Feeling

After several months with the agent workflow, I have a clear impression: it works. Fewer architecture breaks, less duplication, fewer days spent untangling AI-generated spaghetti. If anyone asked me, I would say without hesitation: yes, the effort pays off.

The only problem with this answer: it is exactly what I'm critical of in others – a subjective feeling. And research warns very explicitly against relying on such impressions. A widely cited randomized study by METR showed that experienced open-source developers believed they had become 20% faster after using AI assistants – in reality they were 19% slower. The discrepancy between perceived and measured productivity is not small, but systematic.

Diagramm wird geladen …
Perception vs. measurement: developers believed they were faster – the stopwatch said the opposite.

Equally problematic are activity metrics: number of commits, lines of accepted suggestions, token consumption. They measure how much happens – not whether the right thing happens. An agent workflow that generates three times as many suggestions as a direct prompt looks more productive in this logic – even if it ultimately produced the same solution in three times the time.

What Research Says About Real Measurement

Anyone who wants to take this topic seriously cannot avoid a look at the recent research literature. Three works were particularly formative for my own understanding.

The study "Measuring AI's True Impact on Developer Productivity" (2025) examines how acceptance rates for AI suggestions evolve over months. Perhaps the most striking finding: junior developers achieve significantly higher productivity gains in percentage terms than seniors. Anyone rolling this out to a team can therefore not rely on a blanket effect – the effects are unevenly distributed and fluctuate over time. Acceptance rates as a standalone metric are hardly meaningful because they say nothing about whether accepted suggestions stay in the code in the long run.

The paper "The State of Generative AI in Software Development" (2026) analyzes the entire software development life cycle and arrives at a number that particularly caught my attention: pure code generation covers only about 10–15% of development time. Put differently: even if an AI tool completely replaced typing, the majority of software engineering would remain untouched – architecture, understanding, review, debugging, communication. Anyone measuring AI impact only through code output measures the wrong slice of the pie. The authors also explicitly warn of skill erosion and technical debt if this majority is not also deliberately shaped.

The empirical synthesis "AI-Assisted Software Engineering Productivity" (2026) draws a clear conclusion from this: AI impact is not a prompt topic, but a systems problem. Clean measurement only works when you combine three data sources – telemetry from the tools, pull request data from version control, and structured developer surveys. Each single source alone delivers a distorted picture.

This matches my own experience: token statistics from Claude Code tell me what happened – but not whether it was useful. Commit logs show results, but not the effort I put into reviews afterwards. My own gut feeling is consistent, but as shown above: not reliable. Only the triangulation of these three perspectives produces a picture I would trust.

The DX AI Measurement Framework as a Through Line

This triangulation already exists as a worked-out framework. The DX AI Measurement Framework was developed by DX in collaboration with researchers from GitHub, Atlassian, and Booking.com, and has established itself in industry as a pragmatic standard. It divides AI impact measurement into three dimensions – and precisely this breakdown helps me make the question "how do I measure this?" tangible in the first place.

Diagramm wird geladen …
The DX framework at a glance – three dimensions that together produce a robust picture.

Dimension 1: Utilization – Is the Tool Even Being Used?

The first dimension sounds trivial but isn't. It measures whether the AI tools are actively in use at all – through activity rates (who uses what, how often) and acceptance rates (what share of suggestions is actually adopted). On teams this is an important indicator for detecting "shadow IT" and unused licenses.

For my solo setup, activity rate is practically irrelevant – I'm the only user and I see my activity on my account balance. Acceptance rate, however, is also noticeable in a solo setting: how often do I adopt an agent output as it comes? How often do I have to course-correct at a HITL checkpoint? If I have to correct three out of four use cases, that points not to an AI problem but to an agent definition problem – a hint that the raw acceptance rate directly delivers.

Dimension 2: Impact – The Real Movement in the System

The second dimension is the most interesting – and the hardest to measure. Here the framework combines two established models: DORA for technical delivery performance and SPACE for the human side of development work.

DORA: The Operational Delivery Chain

DORA stands for DevOps Research and Assessment and comes from a multi-year research effort by Google Cloud. The four DORA metrics are the industry standard for operational excellence and focus purely on the technical performance of the software delivery chain: how stably and how fast does it run?

MetricMeaning
Deployment FrequencyHow often is code successfully promoted to production?
Lead Time for ChangesHow long does it take from the first commit to production code?
Time to Restore Service (MTTR)How long does it take to fix an error in production?
Change Failure RateWhat percentage of deployments lead to errors or rollbacks?

In practice, two additional derived quantities have a central role in the DX framework: the Cycle Time (from work begun to merge) and the Pull Request Review Time (PR Review Time, the time a code change spends in review before being merged).

SPACE: The Holistic Counterweight

SPACE was developed by researchers at GitHub, Microsoft, and the University of Victoria – as a conscious counterweight to purely quantitative models. The basic idea: productivity is not just lines of code, but multidimensional. SPACE is intended to help avoid "developer burnout" and surface qualitative factors as well.

LetterDimensionWhat is measured?
SSatisfaction & Well-beingHow satisfied are the developers? Are there signs of burnout?
PPerformanceThe outcome of work – quality, reliability, impact.
AActivityNumber of actions (commits, pull requests, documentation).
CCommunication & CollaborationHow good is the flow of information and documentation in the team?
EEfficiency & FlowHow often are developers interrupted? Keyword: cognitive load.

Why Both Together Are Needed for an Honest Picture

The decisive insight from the framework: anyone looking only at a single DORA metric walks into a trap. If code creation becomes faster but PR review time rises simultaneously, that is not a productivity gain – it is a shift in effort. "I write faster" becomes "my colleague reviews longer." Only when both metrics are considered together can you say anything about real usefulness. Equally important: defect density, that is, how many bugs make it into production despite faster delivery.

Diagramm wird geladen …
The DORA trap: code is written faster, in exchange review takes longer – net effect close to zero.

In the chart: the falling line is Cycle Time, the rising one is PR review time. Anyone looking only at the first curve cheers – anyone seeing both together recognizes the shift.

SPACE adds a dimension that pure system data doesn't deliver: cognitive load. Has the AI actually taken load off the developer – such as boilerplate, standard tests, repetitive refactorings? Or has it only shifted the kind of load, so that typing turns into continuous review? Exactly this phenomenon, the "Inverted Loop," I already described in the post about the strategists – the DX framework provides, with cognitive load surveys, the tool to make it visible in the first place.

Where My Workflow Pays Into These Models

When I hold my agent setup up against DORA and SPACE, it becomes tangible what it actually does – and what it doesn't:

  • DORA view: The Developer Agent from post 13 and the walkthrough from post 15 aim at cycle time. From "vague wish" a feature emerges in less than an hour because decomposition into atomic steps massively shortens the implementation phase – without increasing change failure rate, because each step leaves a runnable build.
  • SPACE view: The Solution Architect from post 12 primarily lowers Efficiency & Flow in the sense of cognitive load. The heavy thinking – checking architecture, finding existing patterns, keeping layers clean – shifts from my head into a structured solution design that I only need to review.

This is the point I find most important about this workflow – and that would be lost in a purely DORA-driven measurement: it is not only about producing code faster. It is about making the system as a whole healthier.

Dimension 3: Cost – Tokens, Licenses, and Real Hours

The third dimension is the most sober, but for me solo the most tangible. AI tools cost money – whether as a license, as API consumption, or as token volume in a multi-stage agent workflow. The DX framework recommends weighing these costs against the real time savings from dimension 2 to compute an actual return on investment.

In practice this is not trivial. My workflow with Business Analyst, Solution Architect, Implementation Plan Agent, and Developer Agent consumes noticeably more tokens than a direct prompt – that is the flip side of the decomposition I described in the post about the doers. I have to weigh these costs against the hours I no longer spend on refactorings after architecture breaks. In a solo project this is a mental calculation. On a team it would be a table with clear numbers – and honestly: I believe many setups would not look as good in that table as their advocates assume.

An Honest Point: What I Actually Measure – and What I Don't

I don't want to pretend that I have fully implemented the DX framework for my single-dev projects. That would be overkill for a solo undertaking, and it would also be dishonest.

What I do pragmatically collect are three simple proxies that help me recognize trends – without claiming scientific rigor:

  1. Correction time per checkpoint: how many minutes do I need at the HITL points in phases 1–3 to review and possibly correct the use case, concept, and plan? This is the most direct indicator of how well my agents are tuned. If the time falls over several features, I'm either learning to prompt better – or the agent definitions are growing with me. Both are good.
  2. Number of rework commits after "finished" features: how often did I have to fix something up after a completed feature because it didn't work in production? This is my unofficial defect density. This number is currently noticeably lower than in the time before the agent workflow – but I haven't collected it systematically, only observed it subjectively.
  3. Subjective frustration-freeness: not scientific, but relevant to me. Do I come out of an evening coding session with the feeling of having moved something forward – or with the feeling of having fought symptoms? This is my personal proxy for the SPACE dimension of satisfaction.
Diagramm wird geladen …
My three pragmatic proxies and which DX dimension each maps to.

What I'm consciously missing: controlled comparison runs. I have never built the same feature once with the agent workflow and once with a direct prompt to compare the times and results directly. That would be the clean way – but realistically that's not going to happen in a solo project, because the effort for the second implementation is higher than the expected insight gain.

That's precisely why the scientific studies from the beginning of this post are so valuable: they replace my missing A/B testing with reliable data from controlled comparisons that others did for me.

What This Would Mean for a Team

In a solo project I can get by well with the three pragmatic proxies. On a team the bar would not be negotiable – not because I'd be distrustful, but because the effects multiply and at the same time become blurrier.

A team needs at least three additional things: PR tagging to make traceable which code sections were AI-supported. Clear governance for the agents: which ones are used, how they are defined, which rules apply. And above all: the HITL principle as a binding rule, as I described it in the post about the strategists. Without these three prerequisites, impact cannot be measured – because you don't know what you're measuring.

Conclusion

Anyone taking AI agents seriously must take their impact measurement seriously. Otherwise it remains gut feeling in both directions – enthusiasm and skepticism alike.

The DX AI Measurement Framework is not an end in itself for this, but a language for talking precisely about AI impact. The three dimensions Utilization, Impact, and Cost make visible what otherwise drowns in gut feeling: that speed without quality is not productivity, that acceptance rates without defect density say little, and that costs without time savings are just expenses.

For my solo projects the bar is pragmatic – three simple proxies and an honest look at every coding session. On a team it wouldn't be negotiable. Both have their place. What matters to me in the end: I want to stop confusing "I feel more productive" with "it works."