Vibe Metrics

I believe that this design is better – no need to A/B test. I think that users will like this feature – we can just roll it out. I know what users will respond to, so we can just use this ad copy. In the old days we trusted our gut – after all, we were experienced experts and knew what we were doing – and ended up making really bad decisions.

But any modern marketer, designer, product manager, or engineer would laugh you out of the room if you tried this today. They might argue about which metric to measure, or what’s an acceptable p-value, or methodology, but we’ve moved beyond the days of superstition and blind belief, and into the age of measurable, reproducible results.

[The CEO of Caesar’s Entertainment Corporation] likes to say there are three things that can get you fired from Caesars: Stealing, sexual harassment, and running an experiment without a control group. – Planet Money

All of which is to say that you really have to appreciate the sheer power of AI’s marketing message. They’ve managed to convince you that the tradeoff is about “productivity vs. quality” without providing any kind of metric to demonstrate productivity increases. The quality is increasing, they say, so using AI to juice productivity becomes more and more of a no-brainer.

And let’s be fair – your engineers are telling you that they’re getting more done with AI. As a business leader, this is great news! You trust your engineers, they’re agreeing with the message you’re hearing from AI companies and industry news sources – case closed. Another short blog post from yours truly.

Except that isn’t how metrics work, is it? If you don’t measure what you’re doing, then it isn’t science – it’s vibe metrics. Productivity is notoriously difficult to measure, so I guess we’ll just have to trust the engineers using the tools. After all, engineers are famously accurate when estimating project size, and always complete work within the expected amount of time. They know how long it takes to do things. If they say that they’re getting more done, they must be right.

As it turns out, there have been two different studies that have tried to measure engineering productivity using AI under rigorously controlled conditions. The first study used real world projects, with each task taking approximately two hours to complete. Here’s what the study found:

Methodology

To directly measure the real-world impact of AI tools on software development, we recruited 16 experienced developers from large open-source repositories (averaging 22k+ stars and 1M+ lines of code) that they’ve contributed to for multiple years. Developers provide lists of real issues (246 total) that would be valuable to the repository—bug fixes, features, and refactors that would normally be part of their regular work. Then, we randomly assign each issue to either allow or disallow use of AI while working on the issue. When AI is allowed, developers can use any tools they choose (primarily Cursor Pro with Claude 3.5/3.7 Sonnet—frontier models at the time of the study); when disallowed, they work without generative AI assistance. Developers complete these tasks (which average two hours each) while recording their screens, then self-report the total implementation time they needed. We pay developers $150/hr as compensation for their participation in the study.

Core Result

When developers are allowed to use AI tools, they take 19% longer to complete issues—a significant slowdown that goes against developer beliefs and expert forecasts. This gap between perception and reality is striking: developers expected AI to speed them up by 24%, and even after experiencing the slowdown, they still believed AI had sped them up by 20%.

Wait, what? Not only did the tasks take 19% longer when the engineers were using AI, but these experienced engineers thought that it had made them 20% faster? That’s… concerning.

Luckily for us, we have another study, this one conveniently published by Anthropic. I know, I know, we aren’t supposed to believe what a clearly interested party has to say on the topic, but surely this one will give us a better result?

Motivated by the salient setting of AI and software skills, we design a coding task and evaluation around a relatively new asynchronous Python library and conduct randomized experiments to understand the impact of AI assistance on task completion time and skill development. We find that using AI assistance to complete tasks that involve this new library resulted in a reduction in the evaluation score by 17% or two grade points (Cohen’s d = 0.738, p = 0.010). Meanwhile, we did not find a statistically significant acceleration in completion time with AI assistance.

To summarize: using AI prevented engineers from learning (p = 0.01) and had no statistically significant improvement in task completion time (p = 0.391). The tasks chosen were specifically crafted to allow the AI to “produce the full, correct code for both tasks directly when prompted”, and there was still no statistically significant improvement in completion time.

Vibe metrics. As long as you believe, you don’t need proof.

But let’s dig a little bit deeper. As I mentioned above, measuring productivity is hard. What is productivity, anyway? Is it the amount of work completed over a particular period of time? While that’s probably true, it also isn’t a useful definition. Instead, let’s ask a more meaningful question: how long does it take to complete a single project?

First, let’s stipulate that we’re talking about engineering time, not calendar time. Obviously, there’s the time spent writing the code, making the commits, creating the pull request, and merging the code. And, indeed, if asked, most junior engineers would probably say that the amount of time it took them from start to finish is the amount of time it took to complete the project.

But there’s also the time it took other engineers to help them. If it takes another engineer an hour to bring them up to speed, and another hour to review the code, then that’s another two hours of engineering effort. If it takes hours or days of back and forth with a bad PR, then that should also get added to the ledger. Writing code quickly feels like a productivity increase, but if a code reviewer has to spend additional time on a PR due to bad code, then you’re just engaging in productivity arbitrage.

There’s also QA, including building automated tests and any manual testing. There’s any additional work needed for the deployment. If there are production bugs, those have to be identified, tracked, triaged, and fixed. Production outages cost massive amounts of time for many people. If the code eventually has to be refactored as the result of a bad design or poor quality code, then that should also be added. All of this goes into the full cost of a project. A significant amount of a project’s cost comes after the initial release in the long-term maintenance, but we don’t know how to measure it, so we don’t try.

The studies described above only measured the simplest part of the process. Even in those well-defined scenarios, AI failed to improve task completion time. Perhaps we should believe that code generated by AI will be easier to review? More maintainable over the long-term? That code quality and design will be at least as good as code created by a human? That although your engineers won’t understand the code they write as well (p = 0.01), and won’t be faster writing it (p = 0.391), that using AI will be a productivity win over the long-term?

On what basis?

The AI industry’s say-so?

Your engineers’ belief?

Vibes?

Leave a comment