I think it will be difficult to measure productivity using coding agents in any sense until the industry starts standardizing around specific agentic engineering techniques
Vibe-coding/prompting a feature will produce much more slopulence than writing a .md spec and delegating specific pieces of implementation to the agent. But then is that better than writing it by hand?
Surely there’s an optimal equilibrium but I’m not convinced many engineers have found it yet
Love the analogy and as someone who has been coding for a very long time now and also run a startup (nonbios) in the same space as Cursor/Claude code - I broadly agree.
However I have a different take on two counts:
Firstly, those who build camrys, couldnt do so without AI. But now that they can, they have the opportunity to learn how to build Ferarris. And some of them will push through, and earn their place. We see it happening already at nonbios - non-engineers are 'learning' to build through AI, rather than simply delegating.
Secondly, those who build Ferraris - would still prefer to use AI to do it. The 'taste' is still the limiting function as it is the slowest part to get right, but everything around it is better delegated to AI. I do it myself - but your take is hot - it might not have meaningfully increased the speed. However, despite that, building with AI has a lower 'cognitive' cost, as it can take care of the low level stuff, while I focus on the high level design.
My engineers are hitting 2.5x their previous delivery velocity with agents while still passing all human code reviews. But this took a significant, ground up restructure of our development process. If you're still running Agile and wasting time being hyper prescriptive by typing out individual user stories, agentic dev isn't going to do much for you at all.
We're a product-first org, so whatever their product leadership (which is me for my team) is asking them for. The acceleration is consistent across product teams though.
What is it about Gen-Z'ers that makes them think that the rules of punctuation, capitalization, etc., shouldn't apply to their long-form writing? Your credibility quickly approaches zero when you make the quite incorrect assumption that you've somehow earned the poetic license to override the readability of your content with your poor stylistic choices.
Do you believe you write good content? Imagine how many other people might also think so if they could make it more than half-way through your articles before giving up.
Please consider this, rather than simply taking it as an attack. I stumbled upon your article because someone thought enough of it to write his own derivative article. So you must have something worse saying. But you're needlessly limiting the people whom you might be saying it to.
Until now, I had to deal with narcissistic developers who created spaghetti code, making them untouchable in the eyes of management, even when the code was underperforming, slow and memory-intensive. Now, with AI, I can do it myself: I can iterate on domain-specific areas with ease, and companies don't depend on developer egos.
Hi Ethan, thank you for writing this article. I was very surprised to learn that "comma.ai‘s software subsidiary famously had an alarm that triggered when the codebase exceeded a certain size", but neither Google nor Bing has any search results about this. Could you provide a source for this? Thanks again!
The lore here is the “tinygrad” repo (https://github.com/tinygrad/tinygrad), it had a pre-commit check of whether the total library was under 1000 lines of code or not. Mostly as a dig at PyTorch being quite verbose. It’s since grown past that threshold but for the first 2-3 years it was there. You can watch George Hotz’s old livestreams on YouTube building the library from scratch and enforcing the limit on himself/others contributing.
the empirical evidence used to be with you, but now goes against these claim! cf. metr’s update to their open source source dev productivity study. an interesting point nonetheless. catching up to the frontier has always been easier than pushing the frontier, and in the world of code, coding agents have made that even easier. I do think pushing the frontier has become easier, too, through enabling faster iteration, but much less so than it has accelerated catching up.
I appreciate an article that is trying to be a bit balanced vs hype or anti-hype, but a few comments:
It feels a little early to draw conclusions on the self-reinforcing improvement of harnesses. I agree with Karpathy that it seemed like around December that they became a lot more useful and reliable, just in terms of the models and state of the harnesses. Then the processes on top of that are also still evolving, in terms of how quickly things can go safely.
I agree that taste is important vs just shoveling out new features. I’m just not sure that we can draw a conclusion that Anthropic should have won by now just because they were first. Claude Code was just a side project at first and a lot of time has been spent figuring out what works and what doesn’t. I think we’ll have a much better idea by the end of this year.
Another thing that I think is important to consider for anyone using these tools is that AI doesn’t just have to be used to make new features fast. It can also be very helpful with refactorings. The kind of thing that is too complicated for regex and might end up getting pushed back over and over because it doesn’t feel like there’s time to do it manually or create custom parsing/manipulation scripts.
I think of how refactoring tools in IDEs (that rely on types in languages like C# or TypeScript) helped out in the past. You were more likely to rename things to slightly better names when you could be reasonably sure it wouldn’t break anything. Similar for extract method that knows what the params need to be already. The ergonomics helped.
You can use AI to help improve code quality if you use it explicitly for that. There are certain limitations (Karpathy has talked about limits of conciseness), but you can often go pretty far.
Many good points, but we've known since the Mythical Man-Month that adding more engineers does not always speed up large projects, and it's possible AI agents just have many of the same limitations. A different, more interesting measure would be if newer AI-integrated teams can ship at the same rate and quality but with much smaller teams.
article makes very good points but it’s pretty disingenuous to frame an argument about coding agents with a chart that terminates in q1 of 2025 lol
oh i think taht's just ai LOL
asme trned line holds w axis extending to today - or at least i would stand by taht
Definitely not going to assume that when coding agents only really took off 6 months after your chart ends.
Some of the top engineers in the world stopped writing code as of 4-5 months ago.
eh, doubt it
i know folks like karpathy say that but karpathy's majority of projects dont' involve a ton of horrible codebases
You can stop writing code. We have a bunch of engineers at work who generate practically everything.
The problem is we can't see any productivity gain, any gain for customers, etc.
They are just doing it because it is cool.
I think it will be difficult to measure productivity using coding agents in any sense until the industry starts standardizing around specific agentic engineering techniques
Vibe-coding/prompting a feature will produce much more slopulence than writing a .md spec and delegating specific pieces of implementation to the agent. But then is that better than writing it by hand?
Surely there’s an optimal equilibrium but I’m not convinced many engineers have found it yet
How do I tell Substack I never want to see your content
What is the source of your first graph, the k-shaped productivity curve? Thanks.
Love the analogy and as someone who has been coding for a very long time now and also run a startup (nonbios) in the same space as Cursor/Claude code - I broadly agree.
However I have a different take on two counts:
Firstly, those who build camrys, couldnt do so without AI. But now that they can, they have the opportunity to learn how to build Ferarris. And some of them will push through, and earn their place. We see it happening already at nonbios - non-engineers are 'learning' to build through AI, rather than simply delegating.
Secondly, those who build Ferraris - would still prefer to use AI to do it. The 'taste' is still the limiting function as it is the slowest part to get right, but everything around it is better delegated to AI. I do it myself - but your take is hot - it might not have meaningfully increased the speed. However, despite that, building with AI has a lower 'cognitive' cost, as it can take care of the low level stuff, while I focus on the high level design.
My engineers are hitting 2.5x their previous delivery velocity with agents while still passing all human code reviews. But this took a significant, ground up restructure of our development process. If you're still running Agile and wasting time being hyper prescriptive by typing out individual user stories, agentic dev isn't going to do much for you at all.
Is your revenue exploding / accelerating?
Can you tell us how your engineers are choosing what to deliver?
We're a product-first org, so whatever their product leadership (which is me for my team) is asking them for. The acceleration is consistent across product teams though.
What is it about Gen-Z'ers that makes them think that the rules of punctuation, capitalization, etc., shouldn't apply to their long-form writing? Your credibility quickly approaches zero when you make the quite incorrect assumption that you've somehow earned the poetic license to override the readability of your content with your poor stylistic choices.
Do you believe you write good content? Imagine how many other people might also think so if they could make it more than half-way through your articles before giving up.
Please consider this, rather than simply taking it as an attack. I stumbled upon your article because someone thought enough of it to write his own derivative article. So you must have something worse saying. But you're needlessly limiting the people whom you might be saying it to.
Until now, I had to deal with narcissistic developers who created spaghetti code, making them untouchable in the eyes of management, even when the code was underperforming, slow and memory-intensive. Now, with AI, I can do it myself: I can iterate on domain-specific areas with ease, and companies don't depend on developer egos.
Hi Ethan, thank you for writing this article. I was very surprised to learn that "comma.ai‘s software subsidiary famously had an alarm that triggered when the codebase exceeded a certain size", but neither Google nor Bing has any search results about this. Could you provide a source for this? Thanks again!
uhhhh i guess it wasn’t as famous as I thought
the guy who told me it was an engineer on our team, went to their office - said it was famous, i guess i never double checked them
The lore here is the “tinygrad” repo (https://github.com/tinygrad/tinygrad), it had a pre-commit check of whether the total library was under 1000 lines of code or not. Mostly as a dig at PyTorch being quite verbose. It’s since grown past that threshold but for the first 2-3 years it was there. You can watch George Hotz’s old livestreams on YouTube building the library from scratch and enforcing the limit on himself/others contributing.
the empirical evidence used to be with you, but now goes against these claim! cf. metr’s update to their open source source dev productivity study. an interesting point nonetheless. catching up to the frontier has always been easier than pushing the frontier, and in the world of code, coding agents have made that even easier. I do think pushing the frontier has become easier, too, through enabling faster iteration, but much less so than it has accelerated catching up.
I appreciate an article that is trying to be a bit balanced vs hype or anti-hype, but a few comments:
It feels a little early to draw conclusions on the self-reinforcing improvement of harnesses. I agree with Karpathy that it seemed like around December that they became a lot more useful and reliable, just in terms of the models and state of the harnesses. Then the processes on top of that are also still evolving, in terms of how quickly things can go safely.
I agree that taste is important vs just shoveling out new features. I’m just not sure that we can draw a conclusion that Anthropic should have won by now just because they were first. Claude Code was just a side project at first and a lot of time has been spent figuring out what works and what doesn’t. I think we’ll have a much better idea by the end of this year.
Another thing that I think is important to consider for anyone using these tools is that AI doesn’t just have to be used to make new features fast. It can also be very helpful with refactorings. The kind of thing that is too complicated for regex and might end up getting pushed back over and over because it doesn’t feel like there’s time to do it manually or create custom parsing/manipulation scripts.
I think of how refactoring tools in IDEs (that rely on types in languages like C# or TypeScript) helped out in the past. You were more likely to rename things to slightly better names when you could be reasonably sure it wouldn’t break anything. Similar for extract method that knows what the params need to be already. The ergonomics helped.
You can use AI to help improve code quality if you use it explicitly for that. There are certain limitations (Karpathy has talked about limits of conciseness), but you can often go pretty far.
I think its safe to say that we need more manual QA in today’s world. or at least a focus on SDET
Many good points, but we've known since the Mythical Man-Month that adding more engineers does not always speed up large projects, and it's possible AI agents just have many of the same limitations. A different, more interesting measure would be if newer AI-integrated teams can ship at the same rate and quality but with much smaller teams.
"— the problem isn’t just financial, it’s conceptual"
Lol what's going on here
The Camry vs Ferrari analogy is perfect.
AI helps you build faster, but not *better* at the top end.
Taste still wins.
Interesting approach to this shift here: https://shorturl.at/cvsda
cool article! putting some numbers on the feeling I've had for a while