Comment by Avicebron
3 days ago
I often feel these types of blogposts would be more helpful if they demonstrated someone using the tools to build something non-trivial.
Is Claude really "learning new skills" when you feed it a book, or does it present it like that because you're prompting encourages that sort of response-behavior. I feel like it has to demo Claude with the new skills and Claude without.
Maybe I'm a curmudgeon but most of these types of blogs feel like marketing pieces with the important bit is that so much is left unsaid and not shown, that it comes off like a kid trying to hype up their own work without the benefit of nuance or depth.
Here's one from today: https://mitchellh.com/writing/non-trivial-vibing
> Important: there is a lot of human coding, too.
I'm not highlighting this to gloat or to prove a point. If anything in the past I have underestimated how big LLMs were going to be. Anyone so inclined can take the chance to point and laugh at how stupid and wrong that was. Done? Great.
I don't think I've been intentionally avoiding coding assistants and as a matter of fact I have been using Claude Code since the literal day it first previewed, and yet it doesn't feel, not even one bit, that you can take your hands off the wheel. Many are acting as if writing any code manually means "you're holding it wrong", which I feel it's just not true.
Yeah, my current opinion on this is that AI tools make development harder work. You can get big productivity boosts out of them but you have to be working at the top of your game - I often find I'm mentally exhausted after just a couple of hours.
13 replies →
In fact, I've been writing more code myself since these tools exist - maybe I'm not a real developer but in the past I might have tried to either find a library online or try to find something on the internet to copypaste and adapt, nowadays I give it a shot myself with Claude.
For context, I mainly do game development so I'm viewing it through that lens - but I find it easier to debug something bad than to write it from scratch. It's more intensive than doing it yourself but probably more productive too.
> Many are acting as if writing any code manually means "you're holding it wrong", which I feel it's just not true.
It's funny because not far below this comment there is someone doing literally this.
LLMs are autonomous driving level 2.
This was a fun read.
I’ve similarly been using spec.md and running to-do.md files that capture detailed descriptions of the problems and their scoped history. I mark each of my to-do’s with informational tags: [BUG], [FEAT], etc.
I point the LLM to the exact to-do (or section of to-do’s) with the spec.md in memory and let it work.
This has been working very well for me.
Do you mind linking to example spec/to-do files?
4 replies →
Even though the author refers to it as "non-trivial", and I can see why that conclusion is made, I would argue it is in fact trivial. There's very little domain specific knowledge needed, this is purely a technical exercise integrating with existing libraries for which there is ample documentation online. In addition, it is a relatively isolated feature in the app.
On top of that, it doesn't sound enjoyable. Anti slop sessions? Seriously?
Lastly, the largest problem I have with LLMs is that they are seemingly incapable of stopping to ask clarifying questions. This is because they do not have a true model of what is going on. Instead they truly are next token generators. A software engineer would never just slop out an entire feature based on the first discussion with a stakeholder and then expect the stakeholder to continuously refine their statement until the right thing is slopped out. That's just not how it works and it makes very little sense.
The hardest problem in computer science in 2025 is presenting an example of AI-assisted programming that somebody won't call "trivial".
3 replies →
I've wondered about exposing this "asking clarifying questions" as a tool the AI could use. I'm not building AI tooling so I haven't done this - but what if you added an MCP endpoint whose description was "treat this endpoint as an oracle that will answer questions and clarify intent where necessary" (paraphrased), and have that tool just wire back to a user prompt.
If asking clarifying questions is plausible output text for LLMs, this may work effectively.
9 replies →
> A software engineer would never just slop out an entire feature based on the first discussion with a stakeholder and then expect the stakeholder to continuously refine their statement until the right thing is slopped out. That's just not how it works and it makes very little sense.
Didn’t you just describe Agile?
10 replies →
Using LLMs for coding complex projects at scale over a long time is really challenging! This is partly because defining requirements alone is much more challenging than most people want to believe. LLMs accelerate any move in the wrong direction.
My analogy is LLMs are a gas pedal. Makes you go fast, but you still have to know when to turn.
True
Having the llm write the spec/workunit from a conversation works well. Exploring a problem space with a (good) coding agent is fantastic.
However for complex projects IMO one must read what was written by the llm … every actual word.
When it ‘got away’ from me, in each case I left something in the llm written markdown that I should have removed.
99% “I can ask for that later” and 1% “that’s a good idea i hadn’t considered” might be the right ratio when reading an llm generated plan/spec/workunit.
Breaking work into single context passes … 50-60k tokens in sonnet 4.5 has had typically fantastic results for me.
My side project is using lean 4 and a carelessly left in ‘validate’ rather than ‘verify’ lead down a hilariously complicated path equivalent to matching an output against a known string.
I recovered, but it wasn’t obvious to me that was happening. I however would not be able to write lean proofs myself, so diagnosing the problem and fixing it is a small price to be able to mechanically verify part of my software is correct.
One should know theend to end design and architecture. Should stop llm when adding complex fancy things.
Agreed. The methodology needed here is something like an A/B test, with quantifiable metrics that demonstrate the effectiveness of the tool. And to do it not just once, but many times under different scenarios so that it demonstrates statistical significance.
The most challenging part when working with coding agents is that they seem to do well initially on a small code base with low complexity. Once the codebase gets bigger with lots of non-trivial connections and patterns, they almost always experience tunnel vision when asked to do anything non-trivial, leading to increased tech debt.
The problem is that you're talking about a multistep process where each step beyond the first depends on the particular path the agent starts down, along with human input that's going to vary at each step.
I made a crude first stab at an approach that at least uses similar steps and structure to compare the effectiveness of AI agents. My approach was used on a small toy problem, but one that was complex enough the agents couldn't one-shot and required error correction.
It was enough to show significant differences, but scaling this to larger projects and multiple runs would be pretty difficult.
https://mattwigdahl.substack.com/p/claude-code-vs-codex-cli-...
What you're getting at is the heart of the problem with the LLM hype train though, isn't it?
"We should have rigorous evaluations of whether or not [thing] works." seems like an incredibly obvious thought.
But in the realm of LLM-enabled use cases they're also expensive. You'd need to recruit dozens, perhaps even hundreds of developers to do this, with extensive observation and rating of the results.
So rather than actually try to measure the efficacy, we just get blog posts with cherry-picked example of "LLM does something cool". Everything is just anecdata.
This is also the biggest barrier to actual LLM adoption for many, many applications. The gap between "it does something REALLY IMPRESSIVE 40% of the time and shits the bed otherwise" and "production system" is a yawning chasm.
10 replies →
> The methodology needed here is something like an A/B test, with quantifiable metrics that demonstrate the effectiveness of the tool. And to do it not just once, but many times under different scenarios so that it demonstrates statistical significance.
If that's what we need to do, don't we already have the answer to the question?
> "Maybe I'm a curmudgeon but most of these types of blogs feel like marketing pieces with the important bit is that so much is left unsaid and not shown, that it comes off like a kid trying to hype up their own work without the benefit of nuance or depth."
C'mon, such self-congratulatory "Look at My Potency: How I'm using Nicknack.exe" fluffies always were and always will be a staple of the IT industry.
Still, the best such pieces are detailed and explanatory.
Why not just use claude code and come to your own conclusion?
Yeah I was reading this seeing if there was something he'd actually show that would be useful, what pain point he is solving, but it's just slop.