Comment by llmslave2
2 days ago
One thing I find really funny is when AI enthusiasts make claims about agents and their own productivity its always entirely anecdotally based on their own subjective experience, but when others make claims to the contrary suddenly there is some overwhelming burden of proof that has to be reached in order to make any sort of claims regarding the capabilities of AI workflows. So which is it?
A while ago someone posted a claim like that on LinkedIn again. And of course there was the usual herd of LinkedIn sheep who were full of compliments and wows about the claim he was making: a 10x speedup of his daily work.
The difference with the zillion others who did the same, is that he attached a link to a live stream where he was going to show his 10x speedup on a real life problem. Credits to him for doing that! So I decided to go have a look.
What I then saw was him struggling for one hour with some simple extension to his project. He didn't manage to finish in the hour what he was planning to. And when I had some thought about how much time it would have cost me by hand, I found it would have taken me just as long.
So I answered him in his LinkedIn thread and asked where the 10x speed up was. What followed was complete denial. It had just been a hick up. Or he could have done other things in parallel while waiting 30 seconds for the AI to answer. Etc etc.
I admit I was sceptic at the start but I honestly had been hoping that my scepticism would be proven wrong. But not.
I'm going to try and be honest with you because I'm where you were at 3 months ago
I honestly don't think there's anything I can say to convince you because from my perspective that's a fools errand and the reason for that has nothing to do with the kind of person either of us are, but what kind of work we're doing and what we're trying to accomplish
The value I've personally been getting which I've been valuing is that it improves my productivity in the specific areas where it's average quality of response as one shot output is better than what I would do myself because it is equivalent to me Googling an answer, reading 2 to 20 posts, consolidating that information together and synthesising an output
And that's not to say that the output is good, that's to say that the cost of trying things as a result is much cheaper
It's still my job to refine, reflect, define and correct the problem, the approach etc
I can say this because it's painfully evident to me when I try and do something in areas where it really is weak and I honestly doubt that the foundation model creators presently know how to improve it
My personal evidence for this is that after several years of tilting those windmills, I'm successfully creating things that I have on and off spent the last decade trying to create successfully and have had difficulty with not because I couldn't do it, but because the cost of change and iteration was so high that after trying a few things and failing, I invariably move to simplifying the problem because solving it is too expensive, I'm now solving a category of those problems now, this for me is different and I really feel it because that sting of persistent failure and dread of trying is absent now
That's my personal perspective on it, sorry it's so anecdotal :)
>The value I've personally been getting which I've been valuing is that it improves my productivity in the specific areas where it's average quality of response as one shot output is better than what I would do myself because it is equivalent to me Googling an answer, reading 2 to 20 posts, consolidating that information together and synthesising an output
>And that's not to say that the output is good, that's to say that the cost of trying things as a result is much cheaper
But there's a hidden cost here -- by not doing the reading and reasoning out the result, you have learned nothing and your value has not increased. Perhaps you extended a bit less energy producing this output, but you've taken one more step down the road to atrophy.
12 replies →
No I agree with you, there are area's where AI is helping amazingly. Every now and then it helps me with some issue as well, which would have cost me hours earlier and now it's done in minutes. E.g. some framework that I'm not that familiar with, or doing the scaffolding for some unit test.
However this is only a small portion of my daily dev work. For most of my work, AI helps me little or not at all. E.g. adding a new feature to a large codebase: forget it. Debugging some production issue: maybe it helps me a little bit to find some code, but that's about it.
And this is what my post was referring to: not that AI doesn't help at all, but to the crazy claims (10x speedup in daily work) that you see all over social media.
Example for me: I am primarily a web dev today. I needed some kuberntes stuff setup. Usually that’s 4 hours of google and guess and check. Claude did it better in 15 minutes.
Even if all it does is speed up the stuff i suck at, that’s plenty. Oh boy docker builds, saves my bacon there too.
6 replies →
[flagged]
2 replies →
you haven't contributed much to GitHub since 2022?
*edit unless your commits are elsewhere?
I think people get into a dopamine hit loop with agents and are so high on dopamine because its giving them output that simulates progress that they don't see reality about where they are at. It is SO DAMN GOOD AT OUTPUT. Agents love to output, it is very easy to think its inventing physics.
Obviously my subjective experience
Ironic that I’m going to give another anecdotal experience here, but I’ve noticed this myself too. I catch myself trying to keep on prompting after an llm has not been able to solve some problem in a specific way. While I can probably do it faster at that point if I switch to doing it fully myself. Maybe because the llm output feels like its ‘almost there’, or some sunken cost fallacy.
1 reply →
> I think people get into a dopamine hit loop
I also think that's the case, but I'm open to the idea that there are people that are really really good at this and maybe they are indeed 10x.
My experience is that for SOME tasks LLMs help a lot, but overall nowhere near 10x.
Consistently it's probably.... ~1X.
The difference is I procrastinate a lot and LLMs actually help me not procrastinate BECAUSE of that dopamine kick and I'm confident I will figure it out with an LLM.
I'm sure there are many people who got to a conclusion on their to-do projects with the help of LLMs and without them, because of procrastination or whatever, they would not have had a chance to.
It doesn't mean they're now rich, because most projects won't make you rich or make you any money regardless if you finish them or not
1 reply →
You nailed it - like posting on social media and getting dopamine hits as you get likes and comments. Maybe that's what has got all these vibe coders hooked.
> What I then saw was him struggling for one hour with some simple extension to his project. He didn't manage to finish in the hour what he was planning to. And when I had some thought about how much time it would have cost me by hand, I found it would have taken me just as long.
For all who are doing that, what is the experience of coding in a livestream? It is something I never attempted, the simple idea makes me feel uncomfortable. A good portion of my coding would be rather cringe, like spending way too long on a stupid copy-paste or sign error that my audience would have noticed right away. On the other hand, sometimes, I am really fast because everything is in my head, but then I would probably lose everyone. I am impressed when looking at live coders by how fluid it looks compared to my own work, maybe there is a rubber duck effect at work here.
All this to say that I don't know how working solo compares to a livestream. It is more or less efficient, maybe it doesn't matter that much when you get used to it.
Have done it, never enough of an audience to be totally humiliated. It's never going to be more efficient.
But as for your cringe issue that the audience noticed, one could see that to be a benefit -- prefer to have someone say e.g. "you typed `Normalise` (with an 's') again, C++ is written in U.S. English, don't you know / learn to spell, you slime" upfront than waiting for compiler to tell you that `Normalise` doesn't exist, maybe?
I suspect livestream coding, like programming competition coding and whiteboard coding for interviews, is a separate skill that's fairly well correlated with being able to solve useful problems, but it is not the same thing. You can be an excellent problem solver without being good at doing so while being watched, under time pressure.
I feel like I've been incredibly productive with AI assisted programming over the past few weeks, but it's hard to know what folks' baselines are. So in the interest of transparency, I pushed it all up to sourcehut and added Co-Authored-By footers to the AI-assisted commits (almost all of them).
Everything is out there to inspect, including the facts that I:
- was going 12-18 hours per day
- stayed up way too late some nights
- churned a lot (+91,034 -39,257 lines)
- made a lot of code (30,637 code lines, 11,072 comment lines, plus 4,997 lines of markdown)
- ended up with (IMO) pretty good quality Ruby (and unknown quality Rust).
This is all just from the first commit to v0.8.0. https://git.sr.ht/~kerrick/ratatui_ruby/tree/v0.8.0
What do you think: is this fast, or am I just as silly as the live-streamer?
P.S. - I had an edge here because it was a green-field project and it was not for my job, so I had complete latitude to make decisions.
I don't really know Ruby, so maybe I'm missing something major, but your commit messages seem extremely verbose yet messy (I can't make heads or tails of them) and I'm seeing language like "deprecated" and a stream of "releases" within a period of hours and it just looks a bit like nonsense.
Don't take "nonsense" negatively, please -- I mean it looks like you were having fun, which is certainly to be encouraged.
1 reply →
There were such people also here.
Copy-pasting the code would have been faster than their work, and there were several problems with their results. But they were so convinced that their work is quick and flawless, that they post a video recording of it.
Hackernews is dominated by these people
LLM marketers have succeeded at inducing collective delusion
3 replies →
> So I answered him in his LinkedIn thread and asked where the 10x speed up was. What followed was complete denial. It had just been a hick up. Or he could have done other things in parallel while waiting 30 seconds for the AI to answer. Etc etc.
So I’ve been playing with LLMs for coding recently, and my experience is that for some things, they are drastically faster. And for some other things, they will just never solve the problem.
Yesterday I had an LLM code up a new feature with comprehensive tests. It wasn’t an extremely complicated feature. It would’ve taken me a day with coding and testing. The LLM did the job in maybe 10 minutes. And then I spent another 45 minutes or so deeply reviewing it, getting it to tweak a few things, update some test comments, etc. So about an hour total. Not quite a 10x speed up, but very significant.
But then I had to integrate this change into another repository to ensure it worked for the real world use case and that ended up being a mess, mostly because I am not an expert in the package management and I was trying to subvert it to use an unpublished package. Debugging this took the better part of the day. For this case, the LLM may be saved me maybe 20% because it did have a couple of tricks that I didn’t know about. But it was certainly not a massive speed up.
So far, I am skeptical that LLM’s will make someone 10x as efficient overall. But that’s largely because not everything is actually coding. Subverting the package management system to do what I want isn’t really coding. Participating in design meetings and writing specs and sending emails and dealing with red tape and approvals is definitely not coding.
But for the actual coding specifically, I wouldn’t be surprised if lots of people are seeing close to 10x for a bunch of their work.
I suspect there's also a good amount of astroturfing happening here as well, making it harder to find the real success stories.
I've noticed a similar trend. There seems to be a lot of babysitting and hand holding involved with vibe-coding. Maybe it can be a game changer for "non-technical founders" stumbling their way through to a product, but if you're capable of writing the code yourself, vibe coding seems like a lot of wasted energy.
Even if this would take two, three hours and a vibe coder, still cheaper then a real developer
Shopify's CEO just posted the other day that he's super productive using the newest AI models and many of the supportive comments responding to his claim were from CEOs of AI startups.
Theres too much money, time and infrastructure committed for this to be anything but successful.
Its tougher than a space race or the nuclear bomb race because there are fewer hard tangibles as evidence of success.
I think there is also some FOMO involved. Once people started saying how AI was helping them be more productive, a lot of folks felt that if they didn't do the same, they were lagging behind.
Sounds like someone trying to sell a course or something.
Maybe he would have otherwise struggled for 10 hours on that extension.
10 times zero is still zero!
You're supposed to believe in his burgeoning synergy so that one day you may collaborate to push industry leading solutions
[dead]
It's an impossible thing to disprove. Anything you say can be countered by their "secret workflow" they've figured out. If you're not seeing a huge speedup well you're just using it wrong!
The burden of proof is 100% on anyone claiming the productivity gains
I go to meetups and enjoy myself so much; 80% of people are showing how to install 800000000 MCPs on their 92gb macbook pros, new RAG memory, n8n agent flows, super special prompting techniques, secret sauces, killer .md files, special vscode setups and after that they still are not productive vs just vanilla claude code in a git repos. You get people saying 'look I only have to ask xyz... and it does it! magic' ; then you just type in vanilla CC 'do xyz' and it does exactly the same thing, often faster.
This was always the case. People obsessing over keyboards, window managers, emacs setups... always optimizing around the edges of the problem, but this is all taking an incredible amount of their time versus working on real problems.
14 replies →
That perfectly ties with my experience. Just direct prompts, with limited setup and limited context seem to work better or just as well as complex custom GPTs. There are not just diminishing, but inverting returns to complexity in GPTs
2 replies →
No, no, you misunderstand, that's still massive productivity improvement compared to them being on their own with their own incompetence and refusal to learn how to code properly
1 reply →
This gets comical when there are people, on this site of all places, telling you that using curse words or "screaming" with ALL CAPS on your agents.md file makes the bot follow orders with greater precision. And these people have "engineer" on their resumes...
there's actually quite a bit of research in this field, here's a couple:
"ExpertPrompting: Instructing Large Language Models to be Distinguished Experts"
https://arxiv.org/abs/2305.14688
"Persona is a Double-edged Sword: Mitigating the Negative Impact of Role-playing Prompts in Zero-shot Reasoning Tasks"
https://arxiv.org/abs/2408.08631
7 replies →
I've been trying to stop the coding assistants from making git commits on their own and nothing has been working.
25 replies →
Wasn’t cursor or someone using one of these horrifying type prompts? Something about having to do a good job or they won’t be paid and then they won’t be able to afford their mother’s cancer treatment and then she’ll die?
How is this not any different than the Apple "you're holding it wrong" argument. I mean the critical reason for that kind of response being so out of touch is that the same people praise Apple for its intuitive nature. How can any reasonable and rational person (especially an engineer!) not see that these two beliefs are in direct opposition?
If "you're holding it wrong" then the tool is not universally intuitive. Sure, there'll always be some idiot trying to use a lightbulb to screw in a nail, but if your nail has threads on it and a notch on the head then it's not the user's fault for picking up a screwdriver rather than a hammer.
What scares me about ML is that many of these people have "research scientist" in their titles. As a researcher myself I'm constantly stunned at people not understanding something so basic like who has the burden of proof. Fuck off. You're the one saying we made a brain by putting lightning into a rock and shoving tons of data into it. There's so much about that that I'm wildly impressed by. But to call it a brain in the same way you say a human brain is, requires significant evidence. Extraordinary claims require extraordinary evidence. There's some incredible evidence but an incredible lack of scrutiny that that isn't evidence for something else.
>makes the bot follow orders with greater precision.
Gemini will ignore any directions to never reference or use youtube videos, no matter how many ways you tell it not to. It may remove it if you ask though.
8 replies →
Yes, using tactics like front-loading important directives,
and emphasizing extra important concepts,
things that should be double or even triple checked for correctness because of the expected intricacy,
make sense for human engineers as well as “AI” agents.
I‘d say such hacks don‘t make you an engineer but they are definitely part of engineering anything that has to do with LLMs. With too long systemprompts/agents.md not working well it definitely makes sense to optimize the existing prompt with minimal additions. And if swearwords, screaming, shaming or tipping works, well that‘s the most token efficient optimization of an brief well written prompt.
Also of course current agents already have to possibility to run endlessly if they are well instructed, steering them to avoid reward hacking in the long term definitely IS engineering.
Or how about telling them they are working in an orphanage in Yemen and it‘s struggling for money, but luckily they‘ve got a MIT degree and now they are programming to raise money. But their supervisor is a psychopath who doesn’t like their effort and wants children to die, so work has to be done as diligently as possible and each step has to be viewed through the lens that their supervisor might find something to forbid programming.
Look as absurd as it sounds a variant of that scenario works extremely well for me. Just because it’s plain language it doesn’t mean it can’t be engineering, at least I‘m of the opinion that it definitely is if has an impact on what’s possible use cases
> cat AGENTS.md
WRITE AMAZING INCREDIBLE VERY GOOD CODE OR ILL EAT YOUR DAD
..yeah I've heard the "threaten it and it'll write better code" one too
3 replies →
Except that is demonstrably true.
Two things can be true at the same time: I get value and a measurable performance boost from LLMs, and their output can be so stupid/stubborn sometimes that I want to throw my computer out the window.
I don't see what is new, programming has always been like this for me.
Works on human subordinates too, kinda, if you don't mind the externalities…
"don't make mistakes" LMAO
There's no secret IMO. It's actually really simple to get good results. You just expect the same things from the LLM you would from a Junior. Use an MD file to force it to:
1) Include good comments in whatever style you prefer, document everything it's doing as it goes and keep the docs up to date, and include configurable logging.
2) Make it write and actually execute unit tests for everything before it's allowed to commit anything, again through the md file.
3) Ensure it learns from it's mistakes: Anytime it screws up tell it to add a rule to it's own MD file reminding it not to ever repeat that mistake again. Over time the MD file gets large, but the error rate plummets.
4) This is where it drifts from being treated as a standard Junior. YOU must manually verify that the unit tests are testing for the right thing. I usually add a rule to the MD file telling it not to touch them after I'm happy with them, but even then you must also now check that the agent didn't change them the first time it hit a bug. Modern LLM's are now worse at this for some reason. Probably because they're getting smart enough to cheat.
If you these basic things you'll get good results almost every time.
> This is where it drifts from being treated as a standard Junior. YOU must manually verify that the unit tests are testing for the right thing.
You had better juniors than me. What unit tests? :P
The MD file is a spec sheet, so now you're expecting every warm body to be a Sr. Engineer, but where do you start as a Junior warm body? Reviewing code, writing specs, reviewing implementation details...that's all Sr. level stuff
It's impossible to prove in either direction. AI benchmarks suck.
Personally, I like using Claude (for the things I'm able to make it do, and not for the things I can't), and I don't really care whether anyone else does.
I'd just like to see a live coding session from one of these 10x AI devs
Like genuinely. I want to get stuff done 10x as fast too
30 replies →
> AI benchmarks suck.
Not only do they suck, but it's an essentially an impossible task since there is no frame of reference on what "good code" looks like.
[dead]
Ah, the "then you are doing it wrong" defence.
Also, you have to learn it right now, because otherwise it will be too late and you will be outdated, even though it is improving very fast allegedly.
TBF, there are lots of tools that work great but most people just can't use.
I personally can't use agentic coding, and I'm reasonably convinced the problem is not with me. But it's not something you can completely dismiss.
> Also, you have to learn it right now, because otherwise it will be too late and you will be outdated, even though it is improving very fast allegedly.
This in general is a really weird behaviour that I come across a lot, I can't really explain it. For example, I use Python quite a lot and really like it. There are plenty of people who don't like Python, and I might disagree with them, but I'm not gonna push them to use it ("or else..."), because why would I care? Meanwhile, I'm often told I MUST start using AI ("or else..."), manual programming is dead, etc... Often by people who aren't exactly saying it kindly, which kind of throws out the "I'm just saying it out of concern for you" argument.
3 replies →
That one's my favorite. You can't defend against it, it just shuts down the conversation. Odds are, you aren't doing it wrong. These people are usually suffering from Dunning Kruger at best, or they're paid shills/bots at worst.
2 replies →
People say it takes at least 6 months to learn how to use LLM's effectively, while at the same time the field is rapidly changing so fast, while at the same time Agents were useless until Opus 4.5.
Which is it lol.
1 reply →
If you had negative results using anything more than 3 days old, then it's your fault, your results mean nothing because they've improved since then. /s
Many of them are also exercising absurd token limits - like running 10 claudes at once and leaving them running continuously to "brute force" solutions out. It may be possible but it's not really an acceptable workflow for serious development.
> but it's not really an acceptable workflow for serious development.
At what cost does do you see this as acceptable? For example, how many hours of saved human development is worth one hour of salary for LLM tokens, funded by the developer? And then, what's acceptable if it's funded by the employer?
1 reply →
we get $1,000/month budget, just about every dev uses it for 5 claude accounts
We have had the fabled 10x engineer long before and independent of agentic coding. Some people claim it's real, others claim it's not, with much the same conviction. If something, that should be so clear cut, is debatable, why would anyone now be able to produce a convincing, discussion-resolving argument for (or against) agentic coding? We don't even manage to do that for tab/spaces.
The reason why both can't be resolved in a forum like this, is that coding output is hard to reason about for various reasons and people want it to be hard to reason about.
I would like to encourage people to think that the burden of proof always falls on themselves, to themselves. Managing to not be convinced in an online forum (regardless of topic or where you land on the issue) is not hard.
I just saw nstummbillig shout racist remarks.
[flagged]
They remind me so much of that group of people who insist the scammy magnetic bracelets[1] "balance their molecules" or something making them more efficient/balanced/productive/energetic/whatever. They are also impossible to argue with, because "I feel more X" is damn near impossible to disprove.
[1] https://en.wikipedia.org/wiki/Power_Balance , https://en.wikipedia.org/wiki/Hologram_bracelet , https://en.wikipedia.org/wiki/Ionized_jewelry
> The burden of proof is 100% on anyone claiming the productivity gains
IMHO, I think this is just going to go away. I was up until recently using copilot in my IDE or the chat interface in my browser and I was severely underwhelmed. Gemini kept generating incorrect code which when pasted didn't compile, and the process was just painful and a brake on productivity.
Recently I started using Claude Code cli on their latest opus model. The difference is astounding. I can give you more details on how I am working with this if you like, but for the moment, my main point is that Claude Code cli with access to run the tests, run the apps, edit files, etc has made me pretty excited.
And my opinion has now changed because "this is the worst it will be" and I'm already finding it useful.
I think within 5 years, we won't even be having this discussion. The use of coding agents will be so prolific and obviously beneficial that the debate will just go away.
(all in my humble opinion)
So will all the tech jobs in the US. When it gets that good you can farm it out to some other country for much cheaper.
1 reply →
I mean, a DSL packed full of features, a full LSP, DAP for step debugging, profiling, etc.
https://github.com/williamcotton/webpipe
https://github.com/williamcotton/webpipe-lsp
https://github.com/williamcotton/webpipe-js
Take a look at my GitHub timeline for an idea of how little time this took for a solo dev!
Sure, there’s some tech debt but the overall architecture is pretty extensible and organized. And it’s an experiment. I’m having fun! I made my own language with all the tooling others have! I wrote my own blog in my own language!
One of us, one of us, one of us…
people claiming productivity gains do not have to prove anything to anyone. few are trying to open eyes of others but my guess is that will eventually stop. they will be the few though still left doing this SWE work in near future :)
Responses are always to check your prompts, and ensure you are using frontier models - along with a warning about how you will quickly be made redundant if you don't lift your game.
AI is generally useful, and very useful for certain tasks. It's also not initiating the singularity.
Some fuel for the fire: the last two months mine has become way better, one-shotting tasks frequently. I do spend a lot of time in planning mode to flesh out proper plans. I don't know what others are doing that they are so sceptical, but from my perspective, once I figured it out, it really is a massive productivity boost with minimal quality issues. I work on a brownfield project with about 1M LoC, fairly messy, mostly C# (so strong typing & strict compiler is a massive boon).
My work flow: Planning mode (iterations), execute plan, audit changes & prove to me the code is correct, debug runs + log ingestion to further prove it, human test, human review, commit, deploy. Iterate a couple of times if needed. I typically do around three of these in parallel to not overload my brain. I have done 6 in the past but then it hits me really hard (context switch whiplash) and I start making mistakes and missing things the tool does wrong.
To the ones saying it is not working well for them, why don't you show and tell? I cannot believe our experiences are so fundamentally different, I don't have some secret sauce but it did take a couple of months to figure out how to best manipulate the tool to get what I want out of it. Maybe these people just need to open their minds and let go of the arrogance & resistance to new tools.
> My work flow: Planning mode (iterations), execute plan, audit changes & prove to me the code is correct, debug runs + log ingestion to further prove it, human test, human review, commit, deploy. Iterate a couple of times if needed.
I'm genuinely curious if this is actually more productive than a non-AI workflow, or if it just feels more productive because you're not writing the code.
One reason why it can be more productive is that it can be asynchronous. I can have Claude churning away on something while I do something else on a different branch. Even if the AI takes as long as a human to do the task, we're doing a parallelism that's not possible with just one person.
Go through a file of 15000 lines of complex C# business logic + db code, and search for specific thing X and refactor it, while going up & down the code to make sure it is correct. Typically these kinds of tasks can take anywhere from 1 day to a week for a good human developer, depending on the mess and who wrote it (years ago under different conditions). With my workflow I can get a good analysis of what the code is doing, where to refactor (and which parts to leave alone), where some risks are, find other issues that I didn't even know about before - all within 10 minutes. Then to do my iteration above to fix it (planning & coding) takes about another 30 minutes. So 30 minutes vs 1 week of hair pulling and cursing (previous developers choices..)... And it is not vibe coding, I check every single change in git diff tool long before committing, I understand everything being done and why before I use it.
Next level tool for this: https://github.com/covibes/zeroshot/
Here is a short example from my daily live, A D96A INVOIC EDI message containing multiple invoices transformed into an Excel file.
I used the ChatGPT web interface for this one-off task.
Input: A D96A INVOIC text message. Here is what those look like, a short example, the one I had was much larger with multiple invoices and tens of thousands of items: https://developer.kramp.com/edi-edifact-d96a-invoic
The result is not code but a transformed file. This exact scenario can be made into code easily though by changing the request from "do this" to "provide a [Python|whatever] script to do this". Internally the AI produces code and runs it, and gives you the result. You actually make it do less work if you just ask for the script and to not run them.
Only what I said. I had to ask for some corrections because it made a few mistakes in code interpretations.
> (message uploaded as file)
> Analyze this D.96A message
> This message contains more than one invoice, you only parsed the first one
(it finds all 27 now)
> The invoice amount is in segment "MOA+77". See https://www.publikationen.gs1-germany.de/Complete/ae_schuhe/... for a list of MOA codes (German - this is a German company invoice).
> Invoice 19 is a "credit note", code BGM+381. See https://www.gs1.org/sites/default/files/docs/eancom/ean02s4/... for a list of BGM codes, column "Description" in the row under "C002 DOCUMENT/MESSAGE NAME"
> Generate Excel report
> No. Go back and generate a detailed Excel report with all details including the line items, with each invoice in a separate sheet.
> Create a variant: All 27 invoices in one sheet, with an additional column for the invoice or credit note number
> Add a second sheet with a table with summary data for each invoice, including all MOA codes for each invoice as a separate column
The result was an Excel file with an invoice per worksheet, and meta data in an additional sheet.
Similarly, by simply doing what I wrote above, at the start telling the AI to not do anything but to instead give me a Python script, and similar instructions, I got a several hundred lines ling Python script that processed my collected DESADV EDI messages in XML format ("Process a folder of DESADV XML files and generate an Excel report.")
If I had had to actually write that code myself, it would have taken me all day and maybe more, mostly because I would have had to research a lot of things first. I'm not exactly parsing various format EDI messages every day after all. For this, I wrote a pretty lengthy and very detailed request though, 44 long lines of text, detailing exactly which items with which path I wanted from the XML, and how to name and type them in the result-Excel.
ChatGPT Query: https://pastebin.com/1uyzgicx
Result (Python script): https://pastebin.com/rTNJ1p0c
> why don't you show and tell?
How do you suggest? A a high level, the biggest problem is the high latency and context switches. It is easy enough to get the AI to do one thing well. But because it takes so long, the only way to derive any real benefit is to have many agents doing many things at the same time. I have not yet figured out how to effectively switch my attention between them. But I wouldn't have any idea how to turn that into a show and tell.
I don't know how ya'all are letting the AIs run off with these long tasks at all.
The couple times I even tried that, the AI produced something that looked OK at first and kinda sorta ran but it quickly became a spaghetti I didn't understand. You have to keep such a short leash on it and carefully review every single line of code and understand thoroughly everything that it did. Why would I want to let that run for hours and then spend hours more debugging it or cleaning it up?
I use AI for small tasks or to finish my half-written code, or to translate code from one language to another, or to brainstorm different ways of approaching a problem when I have some idea but feel there's something better way to do it.
Or I let it take a crack when I have some concrete failing test or build, feeding that into an LLM loop is one of my favorite things because it can just keep trying until it passes and even if it comes up with something suboptimal you at least have something that compiles that you can just tidy up a bit.
Sometimes I'll have two sessions going but they're like 5-10 minute tasks. Long enough that I don't want to twiddle my thumbs for that long but small enough that I can rein it in.
2 replies →
Longest task mine has ever done was 30 minutes. Typically around 10 minutes for complex tasks. Most most things take less than 2 minutes (these usually offer most bang for buck as they save me half a day).
As a die hard old schooler, I agree. I wasn't particularly impressed by co-pilot though it did so a few interesting tricks.
Aider was something I liked and used quite heavily (with sonnet). Claude Code has genuinely been useful. I've coded up things which I'm sure I could do myself if I had the time "on the side" and used them in "production". These were mostly personal tools but I do use them on a daily basis and they are useful. The last big piece of work was refactoring a 4000 line program which I wrote piece by piece over several weeks into something with proper packages and structures. There were one or two hiccups but I have it working. Tool a day and approximately $25.
I have basically the same workflow. Planning mode has been the game changer for me. One thing I always wonder is how do people work in parallel? Do you work in different modules? Or maybe you split it between frontend and backend? Would love to hear your experience.
I plan out N features at a time, then have it create N git worktrees and spawn N subagents. It does a decent job. I find doing proper reviews on each worktree kind of annoying, though, so I tend to pull them in one at a time and do a review, code, test, feedback loop until it’s good, commit it, pull in the next worktree and repeat the process.
I literally have 3 folders, each on their own branch. But lately I use 1 folder a lot but work on different features (that won't introduce "merge conflicts" in a sense). Or I do readonly explorations (code auditing is fun!) and another one makes edits on a different feature, and maybe another one does something else in the Flutter app folder. So fairly easy to parallelize things like this. Next step is to install .net sdk + claude on some vm's and just trigger workflows from there, so no ide involved..
You won't be able to parallelize things if you just use the IDE's and their plugins. I do mine in the terminal with extra tabs, outside of the IDE.
This.
If you’re not treating these tools like rockstar junior developers, then you’re “holding it wrong”.
The problem I have with this take is that I'm very skeptical that guiding several junior developers would be more productive than just doing the work myself.
With real junior developers you get the benefit of helping develop them into senior developers, but you really don't get that with AI.
1 reply →
My running joke and justification to our money guy (to pay for expensive tools), is that its like I have 10 junior devs on my side with infinite knowledge (domain expert with too much caffeine), no memory or feelings (I curse it without convo's with HR), can code decent enough (better than most juniors actually), can do excellent system admin....all for a couple hundred dollars a month, which is a bargain!
[flagged]
Ran out of context too soon?
Actually, quite the opposite. It seems any positive comment about AI coding gets at least one response along the lines of "Oh yeah, show me proof" or "Where is the deluge of vibe-coded apps?"
For my part, I point out there are a significant number of studies showing clear productivity boosts in coding, but those threads typically devolve to "How can they prove anything when we don't even know how to measure developer productivity?" (The better studies address this question and tackle it well-designed statistical methods such as randomly controlled trials.)
Also, there are some pretty large Github repos out there that are mostly vibe-coded. Like, Steve Yegge got to something like 350 thousand LoC in 6 weeks on Beads. I've not looked at it closely, but the commit history is there for anyone to see: https://github.com/steveyegge/beads/commits/main/
That seems like a lot more code than a tool like that should require.
It does, but I have no mental model of what would be required to efficiently coordinate a bunch of independently operating agents, so it's hard to make a judgement.
Also about half of it seems to be tests. It even has performance benchmarks, which are always an distant afterthought for anything other than infrastructure code in the hottest of loops! https://news.ycombinator.com/item?id=45729826) that the logical conclusion of AI coding will look very weird to us and I guess this is one glimpse of it.
Please provide links to the studies, I am genuinely curious. I have been looking for data but most studies I find showing an uplift are just looking at LOC or PRs, which of course is nonsense.
Meta measured a 6-12% uplift in productivity from adopting agentic coding. Thats paltry. A Stanford case study found that after accounting for buggy code that needed to be re-worked there may be no productivity uplift.
I haven't seen any study showing a genuine uplift after accounting for properly reviewing and fixing the AI generated code.
>Meta measured a 6-12% uplift in productivity from adopting agentic coding. Thats paltry.
That feels like the right ballpark. I would have estimated 10-20%. But I'd say that's not paltry at all. If it's a 10% boost, it's worth paying for. Not transformative, but worthwhile.
I compare it to moving from a single monitor to a multi-monitor setup, or getting a dev their preferred IDE.
I mention a few here: https://news.ycombinator.com/item?id=45379452
> ... just looking at LOC or PRs, which of course is nonsense.
That's basically a variation of "How can they prove anything when we don't even know how to measure developer productivity?" ;-)
And the answer is the same: robust statistical methods! For instance, amongst other things they compare the same developers over time doing regular day-job tasks with the same quality control processes (review etc.) in place, before and after being allowed to use AI. It's like an A/B test. Spreading across a large N and time duration accounts for a lot of the day-to-day variation.
Note that they do not claim to measure individual or team productivity, but they do find a large, statistically significant difference in the data. Worth reading the methodologies to assuage any doubts.
> A Stanford case study found that after accounting for buggy code that needed to be re-worked there may be no productivity uplift.
I'm not sure if we're talking about the same Stanford study, the one in the link above (100K engineers across 600+ companies) does account for "code churn" (ostensibly fixing AI bugs) and still find an overall productivity boost in the 5 - 30% range. This depends a LOT on the use-case (e.g. complex tasks on legacy COBOL codebases actually see negative impact.)
In any case, most of these studies seem to agree on a 15 - 30% boost.
Note these are mostly from the ~2024 timeframe using the models from then without today's agentic coding harness. I would bet the number is much higher these days. More recent reports from sources like DX find upto a 60% increase in throughput, though I haven't looked closely at this and have some doubts.
> Meta measured a 6-12% uplift in productivity from adopting agentic coding. Thats paltry.
Even assuming a lower-end of 6% lift, at Meta SWE salaries that is a LOT of savings.
However, I haven't come across anything from Meta yet, could you link a source?
3 replies →
more code = better software
If the software has tens of thousands of users without expecting to get any at all, does the code even matter?
3 replies →
- This has been going on for well over a year now.
- They always write relatively long, zealous explainers of how productive they are (including some replies to your comment).
These two points together make me think: why do they care so much to convince me; why don't they just link me to the amazing thing they made, that would be pretty convincing?!
Are they being paid or otherwise incentivised to make these hyperbolic claims? To be fair they don't often look like vanilla LLM output but they do all have the same structure/patter to them.
I think it's a mix of people being actually hyped and wishing this is the future. For me, productivity gains are mostly in areas where I don't have expertise (but the downside, of course, is I don't learn much if I let AI do the work) or when I know it's a throwaway thing and I absolutely don't care about the quality. For example, I'm bedtime reading a series of books for my daughter, and one of them doesn't have a Polish translation, and the Polish publisher stopped working with the author. I vibe coded an app that will extract an epub, translate each of the chapters, and package it back to an epub, with a few features like: saving the translations in sqlite, so the translation can be stopped and resumed, ability to edit translations, add custom instructions etc. It's only ~1000 lines of Rust code, but Claude generated it when I was doing dinner (I just checked progress and prompted next steps every few minutes). I can guarantee that it would take me at least an evening of coding, probably debugging problems along the way, to make it work. So while I know it's limited in a way it still lacks in certain scenarios (novel code in niche technology, very big projects etc), it is kinda game changer in other scenarios. It lets me do small tools that I just wouldn't have time to do otherwise.
So I guess what I'm saying is, even with all the limitations, I kinda understand the hype. That said, I think some people may indeed exaggerate LLMs capabilities, unless they actually know some secret recipe to make them do all those awesome hyped things (but then I would love to see that).
Hilariously the only impressive thing I've ever heard of made in AI was Yegge's "GasTown" which is a Kubernetes like orchestrator... for AI agents. And half of it seemed to be a workaround for "the agents keep stopping so I need another agent to monitor another agent to monitor another agent to keep them on-task".
> why do they care so much to convince me;
Someone might share something for a specific audience which doesn't include you. Not everything shared is required to be persuasive. Take it or leave it.
> why don't they just link me to the amazing thing they made, that would be pretty convincing?!
99.99% of the things I've created professionally don't belong to me and I have no desire or incentives to create or deal with owning open source projects on my own time. Honestly, most things I've done with AI aren't amazing either, it's usually boring routine tasking, they're just done more cost efficiently.
If you flip the script, it's just as damning. "Hey, here's some general approaches that are working well for me, check it out" is always being countered by the AI skeptics for years now as "you're lying and I won't even try it and you're also a bot or a paid shill". Look at basically every AI related post and there's almost always someone ready to call BS within the first few minutes of it being posted.
[dead]
> anecdotally based on their own subjective experience
So the “subjective” part counts against them. It’s better to make things objective. At least they should be reproducible examples.
When it comes to the “anecdotally” part, that doesn’t matter. Anecdotes are sufficient for demonstrating capabilities. If you can get a race car around a track in three minutes and it takes me four minutes, that’s a three minute race car.
The term "anecdotal evidence" is used as a criticism of evidence that is not gathered in a scientific manner. The criticism does not imply that a single sample (a car making a lap in 3 minutes) cannot be used as valid evidence of a claim (the car is capable of making a lap in 3 minutes).
Studies have shown that software engineers are very bad at judging their own productivity. When a software engineer feels more productive the inverse is just as likely to be true. Thats why anecdotal data can't be trusted.
I have never once seen extraordinary claims of AI wins accompanied by code and prompts.
Anecdotal: (of an account) not necessarily true or reliable, because based on personal accounts rather than facts or research.
If you say you drove a 3 minute lap but you didn't time it, that's an anecdote (and is what I mean). If you measured it, that would be a fact.
I think from your top post you also miss “representative”.
If you measure something and amount is N=1 it might be a fact but still a fact true for a single person.
I often don’t need a sample size of 1000 to consider something worth of my time but if it is sample N=1 by a random person on the internet I am going to doubt that.
If I see 1000 people claiming it makes them more productive I am going to check. If it is going to be done by 5 people who I follow and expect they know tech quite well I am going to check as well.
3 replies →
In this case it's more like someone simulated a 3-minute lap and tried to pass it off as a real car with real friction.
They are not the same thing. If something works for me, I can rule out "it doesn't work at all". However, if something doesn't work for me I can't really draw any conclusions about it in general.
> if something doesn't work for me I can't really draw any conclusions about it in general.
You can. The conclusion would be that it doesn’t always work.
The author is not claiming that ai agents don't make him more productive.
"I use LLM-generated code extensively in my role as CEO of Carrington Labs, a provider of predictive-analytics risk models for lenders."
the people having a good experience with it want the people who arent to share how they are using it so they can tell them how they are doing it wrong.
honestly though idc about coding with it, i rarely get to leave excel for my work anyway. the fact that I can OCR anything in about a minute is a game changer though
Productivity gains in programming have always been incredibly hard to prove, esp. on an individual level. We've had these discussions a million times long before AI. Every time a manager tries to reward some kind of metric for "good" code, it turns out that it doesn't work that way. Every time Rust is mentioned, every C fan finds a million reasons why the improvement doesn't actually have anything to do with using Rust.
AI/LLM discussions are the exact same. How would a person ever measure their own performance? The moment you implement the same feature twice, you're already reusing learnings from the first run.
So, the only thing left is anecdotal evidence. It makes sense that on both sides people might be a little peeved or incredulous about the others claims. It doesn't help that both sides (though mostly AI fans) have very rabid supporters that will just make up shit (like AGI, or the water usage).
Imho, the biggest part missing from these anecdotes is exactly what you're using, what you're doing, and what baseline you're comparing it to. For example, using Claude Code in a typical, modern, decently well architected Spring app to add a bunch of straight forward CRUD operations for a new entity works absolutely flawlessly, compared to a junior or even medior(medium?) dev.
Copy pasting code into an online chat for a novel problem, in an untyped, rare language, with only basic instructions and no way for the chat to run it, will basically never work.
This is why I can’t wait for the costs of LLMs to shoot up. Nothing tells you more about how people really feel about AI asssitants than how much they are willing to pay for them. These AI are useful but I would not pay much more than what they are priced at today.
Claims based on personal experience working on real world problems are likelier to be true.
It’s reasonable to accept that AI tools work well for some people and not for others.
There are many ways to integrate these tools and their capabilities vary wildly depending on the kind of task and project.
I will prefix this all by saying I'm not in a professional programming position, but I would consider myself an advanced amateur, and I do code for work some. (General IT stuff)
I think the core problem is a lot of people view AI incorrectly and thus can't use it efficiently. Everyone wants AI to be a Jr or Sr programmer, but I have serious doubts as to the ability of AI to ever have original thought, which is a core requirement of being a programmer. I don't think AI will ever be a programmer, but rather a tool to help programmers take the tedium away. I have seen massive speedups in my own workflow removing the tedium.
I have found prompting AI to be of minimal use, but tab-completion definitely speeds stuff up for me. If I'm about to create some for loop, AI will usually have a pretty good scaffold for me to use. If I need to handle an error, I start typing and AI will autocomplete the error handling. When I write my function documentation I am usually able to just tab-complete it all.
Yes, I usually have to go back and fix some things, and I will often skip various completion hints, but the scaffold is there, and as I start fixing faulty code it generated AI will usually pick up on the fixes and help me tab-complete the fixes themselves. If AI isn't giving me any useful tab-completions, I'll just start coding what I need, and AI picks up after a few lines and I can tab-complete again.
Occasionally I will give a small prompt such as "Please write me a loop that does X", or "Please write a setter function that validates the input", but I'll still treat that as a scaffold and go back and fix things, but I always give it pretty simple tasks and treat it simply as a scaffold generator.
I still run into the same problem solving issues I had before AI, (how do I tackle X problem?) and there isn't nearly as much speedup there, (Although now instead of talking to a rubber duck, I can chat with AI to help figure things out) but once I settle on the solution and start implementing it, I get that AI tab completion boost again.
With all that being said, I do also see massive boosts with fairly basic tasks that can be templated off something that already exists, such as creating unit tests or scaffolding a class, although I do need to go back and tweak things.
In summary, yes, I probably do see a 10x speedup, but it's really a 10x speedup in my typing speed more than a 10x speedup in solving the core issues that make programming challenging and fun.
> I have serious doubts as to the ability of AI to ever have original thought, which is a core requirement of being a programmer
If you find a job as an enterprise software developer, you'd see that your core requirement doesn't hold :)
On one hand "this is my experience, if you're trying to tell me otherwise I need extraordinary proof" is rampant on all sides.
On the other hand one group is saying they've personally experienced a thing working, the other group says that thing is impossible... well it seems to the people who have experienced a thing that the problem is with the skeptic and not the thing.
One group is keen on rushing on destroying society for a quality-of-life improvement that they can't even be bothered to measure.
Someone who swears they have seen ghosts are obviously gonna have a problem with people saying ghosts don't exist. Doesn't mean ghosts exist.
Ok, but if you're saying I've had delusions LLMs being helpful either I need serious psychiatric care or we need to revisit the premise because we're talking about a tool being useful not the existence of supernatural beings.
4 replies →
But there is still a hugely important asymmetry: If the tool turns your office into gods of software, they should be able to prove it with godly results by now.
If I tell you AmbrosiaLLM doesn't turn me into a programming god... Well, current results are already consistent with that, so It's not clear what else I could easily provide.
This is a bit of goalpost moving though because the primary experience is skeptics saying AI couldn't be trusted to design a ham sandwich vs enthusiasts who've make five course meals with AI. (or, you know, the programming equivalent)
Absolutely there's a lot of unfounded speculation going around and a lot of aggressive skepticism of it, and both sides there are generally a little too excited about their position.
But that is fundamentally not what I'm talking about.
Now that the "our new/next model is so good that it's sentient and dangerous" AGI hype has died down, the new hype goalpost is "our new/next model is so good it will replace your employees and do their jobs for you".
Within that motte and bailey is, "well my AI workflow makes me a 100x developer, but my workflow goes to a different school in a different town and you don't know her".
There's value there, I use local and hosted LLMs myself, but I think there's an element of mania at play when it comes to self-evaluation of productivity and efficacy.
what i enjoy the most is every "AI will replace engineers" article is written by an employee working at an AI company with testimonials from other people also working at AI companies
Its really a high level bikeshed. Obviously we are all still using and experimenting with LLM's. However there is a huge gap of experiences and total usefulness depending on the exact task.
The majority of HN's still reach for LLM's pretty regularly even if they fail horribly frequently. Thats really the pit the tech is stuck in. Sometimes it oneshots your answer perfectly, or pair programs with you perfectly for one task, or notices a bug you didn't. Sometimes it wastes hours of your time for various subtle reasons. Sometimes it adamantly insists 2 + 2 = 55
Latest reasoning models don't claim 2 + 2 = 55, and it's hard to find them making an sort of obviously false claims, or not admitting to being mistaken if you point out that they are
I can’t go a full a full conversation without obviously false claims. They will insist you are correct and that your correction is completely correct despite that also being wrong.
2 replies →
It was clearly a simplified example, like I said endless bikeshed.
Here is a real one. I was using the much lauded new Gemini 3? last week and wanted it to do something a slightly specific way for reasons. I told it specifically and added it to the instructions. DO NOT USE FUNCTION ABC.
It immediately used FUNCTION ABC. I asked it to read back its instructions to me. It confirmed what I put there. So I asked it again to change it to another function. It told me that FUNCTION ABC was not in the code, even though it was clearly right there in the code.
I did a bit more prodding and it adamantly insisted that the code it generated did not exist, again and again and again. Yes I tried reversing to USE FUNCTION XYZ. Still wanted to use ABC
TBH a lot of this is subjective. Including productivity.
My other gripe too is productivity is only one aspect of software engineering. You also need to look at tech debt introduced and other aspects of quality.
Productivity also takes many forms so it's not super easy to quantify.
Finally... software engineers are far from being created equal. VERY big difference in what someone doing CRUD apps for a small web dev shop does vs. eg; an infra engineer in big tech.
This is not always the case, but I get the impression that many of them are paid shills, astroturf accounts, bots, etc. Including on HN. Big AI is running on an absurd amount of capital and they're definitely using that capital to keep the hype cycle going as long as possible while they figure out how to turn a profit (or find an exit, if you're cynical - which I am).
That’s a bit of a reductive view.
For example, even the people with the most negative view on AI don’t let candidates use AI during interviews.
You can disagree on the effectiveness of the tools but this fact alone suggests that they are quite useful, no?
There is a difference between being useful for sandboxed toy problems and being useful in production.
Not really. I'd rather find out very quickly that someone doesn't know a domain space rather than having to wade through plausible looking but bad answers to figure out the exact same thing.
At this point it's foolish to assume otherwise. Applies to also places like reddit and X, there are intelligence services and companies with armies of bot accounts. Modern LLM makes it so easy to create content that looks real enough. Manufacturing consent is very easy now.
I think it's a complex discussion because there's a whole bundle of new capabilities, the largest one arguably being that you can build a conversational interface to any piece of software. There's tons of pressure to express this in terms of productivity, financial and business benefits, but like with a coding agent, the main win for me is reduction of cognitive load, not an obvious "now the work gets done 50% faster so corporate can cut half the dev team."
I can talk through a possible code change with it which is just a natural, easy and human way to work, our brains evolved to talk and figure things out in a conversation. The jury is out on how much this actually speeds things up or translates into a cost savings. But it reduces cognitive load.
We're still stuck in a mindset where we pretend knowledge workers are factory workers and they can sit there for 8 hours producing consistently with their brain turned off. "A couple hours a day of serious focus at best" is closer to the reality, so a LLM can turn the other half of the day into something more useful maybe?
There is also the problem that any LLM provider can and absolutely will enshittify the LLM overnight if they think it's in their best interest (feels like OpenAI has already done this).
My extremely casual observations on whatever research I've seen talked about has suggested that maybe with high quality AI tools you can get work done 10-20% faster? But you don't have to think quite as hard, which is where I feel the real benefit is.
As a CS student who kinda knows how to build things. I do in fact get a speedup when querying AI or letting AI do some coding for me. However, I have a poor understanding of the system it builds, and it does a quite frankly terrible job with project architecture. I use Claude sonnet 4.5 with Claude code, and I can get things implemented rather quickly while using it, but if anything goes wrong I just don’t have that great of an idea where anything is, what code is in charge of what, etc. I can also deeply feel the brainrot of using AI. I get lazy and I can feel myself getting worse at solving what should be easy problems. My mental image of the problem to solve gets fuzzy and I don’t train that muscle like I would if I didn’t use AI to help me solve it.
There are different types of contrary claims though, which may be an issue here.
One example: "agents are not doing well with code in languages/frameworks which have many recent large and incompatible changes like SwiftUI" - me: that's a valid issue that can be slightly controlled for with project setup, but still largely unsolved, we could discuss the details.
Another example: "coding agents can't think and just hallucinate code" - me: lol, my shipped production code doesn't care, bring some real examples of how you use agents if they don't work for you.
There's a lot of the second type on HN.
Yeah but there's also a lot of "lol, my shipped production code doesn't care" type comments with zero info about the type of code you're talking about, the scale, and longer term effects on quality, maintainability, and lack of expertise that using agentic tools can have.
That's also far from helpful or particularly meaningful.
There's a lot of "here's how agents work for me" content out there already. From popular examples from simonw and longer videos from Theo, to thousands of posts and comments from random engineers. There's really not much that's worth adding anymore. (Unless you discover something actually new) It works for use cases which many have already described.
2 replies →
… and trollish to boot. Y U gotta “lol”?
But since there’s grey in my beard, I’ve seen it several times: in every technological move forward there are obnoxious hype merchants, reactionary status quo defenders, and then the rest of us doing our best to muddle through,
1 reply →
Last time I ran into this it was a difference of how the person used the AI, they weren't even using the agents, they were complaining that the AI didn't do everything in one shot in the browser. You have to figure out how people are using the models, because everyone was using AI in browser in the beginning, and a lot of people are still using it that way. Those of us praising the agents are using things like Claude Code. There is a night and day difference in how you use it.
>> when others make claims to the contrary suddenly there is some overwhelming burden of proof that has to be reached
That is just plain narcissism. People seeking attention in the slipstream of megatrends, make claims that have very little substance. When they are confronted with rational argument, they can’t respond intellectually, they try to dominate the discussion by asking for overwhelming burden of proof, while their position remains underwhelming.
LinkedIn and Medium are densely concentrated with this sort of content. It’s all for the likes.
Public discourse on this is a dumpster fire. But you're not making a meaningful contribution.
It is the equivalence of saying: Stenotype enthusiasts claim they're productive, but when we give them to a large group of typers we get data disproving that.
Which should immediately highlights the issue.
As long as these discussions aren't prefaced with the metric and methodology, any discussion on this is just meaningless online flame wars / vibe checks.
> One thing I find really funny is when AI enthusiasts make claims about agents and their own productivity its always entirely anecdotally based on their own subjective experience, but when others make claims to the contrary suddenly there is some overwhelming burden of proof that has to be reached in order to make any sort of claims regarding the capabilities of AI workflows. So which is it?
Really? It's little more than "I am right and you are wrong."
subjective experience is heavily influenced by expectations and desires, so they should try to verify.
Everything you need to know about AI productivity is shown in this first chart here:
https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...
Not confident it's quite that straightforward. Here's a presentation from Meta showing a 6-12% increase in diff throughput for above-median users of agentic coding: https://www.youtube.com/watch?v=1OzxYK2-qsI
It's because the thing is overhyped and too many people are vested in keeping the hype going. Facing reality at this point, while necessary, is tough. The amount of ads for scam degrees from reputable unis about 'Chief AI Officer' bullshit positions is staggering. There's just tooo much AI bubbling
If someone seems to have productivity gains when using an AI, it is hard to come up with an alternate explanation for why they did.
If someone sees no productivity gains when using an AI (or a productivity decrease), it is easy to come up with ways it might have happened that weren't related to the AI.
This is an inherent imbalance in the claims, even if we both people have brought 100% proof of there specific claims.
A single instance of something doing X is proof of the claim that something can do X, but no amount of instances of something not doing X is proof of the claim that something cannot do X. (Note, this is different from people claiming that something always does X, as one counter example is enough to disprove that.)
Same issue in math with the difference between proving a conjecture is sometimes true and proving it is never true. Only one of these can be proven by examples (and only a single example is needed). The other can't be proven even by millions of examples.
[flagged]
I don't get it? Yes you should require a valid reason before believing something
The only objective measures I've seen people attempt to take have at best shown no productivity loss:
https://substack.com/home/post/p-172538377
https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...
This matches my own experience using agents, although I'm actually secretly optimistic about learning to use it well
The burden you are placing is too high here. Do you demand controlled trials for everything you do or else you refuse to use it or accept that other people might see productivity gains? Do you demand studies showing that static typing is productive? Syntax highlighting? IDEs or Vim? Unit testing? Whatever language you use?
Obviously not? It would be absurd to walk into a thread about Rust and say “Rust doesn’t increase your productivity and unless you can produce a study proving it does then your own personal anecdotes are worthless.”
Why the increased demand for rigor when it comes to AI specifically?
16 replies →
Why do you believe that the sky is blue? What randomized trial with proper statistical controls has shown this to be true?
12 replies →
[flagged]
pretending the only way anybody comes to a conclusion about anything is by reading peer-journals is an absurdly myopic view of epistemological practices in the real world
5 replies →
[flagged]
Which is it is clear - the enthusiast have spent countless hours learning/configuring/adjusting, figuring out limitations, guarding against issue etc etc etc and now do 50 to 100 PRs per week like Boris
Others … need to roll up the sleeves and catch up
There isn't anything clear until someone manages to publish measurable and reproducible results for these tools while working on real world use cases.
Until then it's just people pulling the lever on a black box.
Hundreds of millions of people use these every day on real world use cases. If they didn’t work, people wouldn’t use them.
This is the measurable evidence you are talking about: https://a16z.com/revenue-benchmarks-ai-apps/
3 replies →
Merely counting PRs is not very impressive to me. My pre LLM average is around 50/week anyway. But I’m not going to claim that somehow makes me the best programmer ever. I’m sure someone with 1 super valuable PR can easily create more value than I do.
Maybe I'm just in a weird place, but I can't imagine 50 PRs a week.
Maybe it's because I spend a lot of my time just turning problem reports reports on slack into tickets with tables of results and stack traces.
2 replies →
A bunch of tiny PRs is not hard to do manually. But LLMs can write boatloads of code to do kind of sophisticated things. You do have to figure out how to get to a point where you can trust the code. But the LLMs can help you write boatloads of tests too based on plain English descriptions.
3 replies →
Or the tool makers could just make better tools. I'm in that camp, I say make the tool adapt to me. Computers are here to help humans, not the reverse.
so when you get a new computer you just use it, as-is, just like out of the box that’s your computer experience? you don’t install any programs, connect printer, nothing eh? too funny reading “tool should adapt to me” and there are roughly 8.3 billion “me” around - can’t even put together what that means honestly
People working in languages/libraries/codebases where LLMs aren't good is a thing. That doesn't mean they aren't good tools, or that those things won't be conquered by AI in short order.
I try to assume people who are trashing AI are just working in systems like that, rather than being bad at using AI, or worse, shit-talking the tech without really trying to get value out of it because they're ethically opposed to it.
A lot of strongly anti-AI people are really angry human beings (I suppose that holds for vehemently anti-<anything> people), which doesn't really help the case, it just comes off as old man shaking fist at clouds, except too young. The whole "microslop" thing came off as classless and bitter.
the microslop thing is largely just a backlash at ms jamming ai into every possible crevice of every program and service they offer with no real plan or goals other than "do more ai"