← Back to context

Comment by azuanrb

18 hours ago

I recently reviewed an app built mostly with vibe coding. The owner said it was almost ready to launch and just needed a quick check.

After looking through it, the database design was a mess. Some features worked, some didn’t. I explained the missing pieces and why things were breaking. Like OP said, he’s the domain expert.

I used billions of tokens last month alone. The tools are getting better fast. But giving AI to a domain expert doesn’t mean you no longer need software engineers.

A domain expert can use AI to build software. And a software engineer can use AI to learn about the domain. Both bring different expertise to the table.

Where I am headed, I think, is to basically be a platform engineer. The job is to create the guardrails, validation, prompt library, and both agent and manual reviews; that keeps the domain experts safe when they start using coding agents.

It's a little bit like being T2/T3 customer support [or support engineer], but internal. You're there to catch the dangerous spots, the weird edge cases, and to make sure that everything is set up correctly, rather than to solve 100% of the routine problems yourself.

There's also plenty of room for cross-cutting-concerns, of course

  • Eventually infrastructure will be more simple to orchestrate too without faults I suspect from well developed devops harnesses. The risk and scale companies are willing to accept will still fall on humans for some time even then. I don't see most people vibe coding a million user app that has deeper needs than the basics we see now.

> I used billions of tokens last month alone.

I use Claude Code (Opus 4.6 at max effort) all day long, and I genuinely don't understand how this is possible. Is that usage paying off?

This is very likely due to my lack of understanding, but... how?

  • Long codex sessions lead to a lot of cached token hits, esp when you resume them after a few hours.

    • I personally don't count cached hits as $used... Neither in my harnesses, nor in the LLM-enabled apps I create. A cached token cannot be counted 1:1 as to a non-cached token, that would be silly.

      Wait... when some Claude 5x/20x users say they are getting "$2000 of tokens for $100," does the 2k value include cached tokens, counted at the same $/token either way?

      We cannot be this dumb as a community, can we? I must be wrong/misunderstanding..

      1 reply →

  • Vibe coded a simple game (10,000 tokens of source code) with two popular coding agents. (Once each, to compare.)

    One spent 200,000 tokens, to produce 10,000.

    The other spent 1.9 million.

    It could have been a single LLM call (10k tokens). lmao

    (I note that the latter was designed by a company whose main source of revenue is token spend...)

  • Don’t forget context. Basically I have 2 billion input and 1 million output. Every prompt you do, sends back the whole thing again and again. Let’s say you have 500k context used, you send 10 messages is 5 million. 100 messages 50 million. Use 5 threats is 250 million.

    • But how is it even possible (bad harness?), or wise, to send 500k or 1M tokens per call? Regarding cache, how are you not hitting the 1hr cache? Also, start new chats early and often!

      I have been "agentic coding" since Sonnet 3.5 and after this paper came out, it became my bible:

      https://github.com/adobe-research/NoLiMa

      Last I checked, all models suck as you fill the context window. "Context engineering" is how you do this whole thing.

Honestly, this is my experience as well. LLMs make it easier to explore other domains, but they do not make you the master of one; you still need expert domain knowledge.

That said, they do make excellent tools to quickly try out new ideas and dive into them; they can even be great learning accelerators if you have a curious mind.

Domain expertise combined with a QA mindset could replace SWE, but consistent QA mindset is rare

  • I agree that a consistent QA mindset is rare, but I'm not sure even if present if it's enough to replace an SDE.

    I very recently looked at the codebase of a vibe-coded app made by someone with domain expertise but no software dev experience.

    It was very clear to me that he had described it from his POV to an AI, and the AI had implemented features in a manner that technically worked, but made future maintenance or expansion extremely tricky, which is why he was now looking for a dev.

    For example, in his data schema, for every item on a menu, instead of simply having an array property like so for ingredients:

        items["latte"]["ingredients"] = ["water", "milk", "sugar", ...]
    

    He had individual flags for every item for every possible ingredient it could have or not have:

        items["latte"]["has_milk"] = true
        items["latte"]["has_nutmeg"] = false
        items["latte"]["has_cinnamon"] = false
        items["latte"]["has_sugar"] = true
        ...
    

    This technically worked and passed tests from his POV at an MVP level. But added a lot of complications when actually trying to build more features or when a new menu item had ingredients the founder hadn't thought to include in the schema beforehand.

    I totally get how he ended up where he did though. While describing it to the AI, he probably said something like "store info on each menu item's ingredients, they might have milk or coffee or sugar", and the AI created individual flags for them and he didn't think to question it, because he didn't know what's "right" or "wrong", but then as he kept building the AI stuck with keeping individual flags instead of swapping it out with an array mechanism, and he couldn't have known the correct way to implement it.

    Only a dev with experience would know how to describe the system to an AI model to get an output that works well, and how to assess the quality of its output beyond what can be assessed through the basic UI. This wasn't a QA failure, it was a design failure.

    • I have found this to be the case as well. As developers we are just really good stewards of the code because we obviously have knowledge to make sure that the code is engineered in a way that it can scale and grow without tech debt becoming unwieldy.

      I found AI to be pretty bad with like a bare bones code base without solid patterns in place already. It works but it's just monolithic files galore. use effects hooks everywhere. Nasty state situations with poor data practices. Security vulnerabilities up the wazoo.

      It's weird to have this conversion with them. Like yeah your code works but it's so tangled up it's hard to reason about where to start to begin to unwind it all sometimes.

      It can be done but cleaning up someone else's slop is the exact reason why I hate AI. It was hard enough to review great code and be critical, honest, and fair but we knew it was an essential part of the process, helped build shared understanding, and was a way to learn from one another.

      Whereas throwing in jumbled garbage to review just feels like a waste of our brain cells we spent decades earning by embracing the craft.

  • I disagree. At some point of complexity, building it yourself is faster, better and (as we're finding out) cheaper. And more fun, although that varies person to person.

    Wrestling with a code generator also creates a sunk cost fallacy where progress grinds to a halt but you still try and use the tools to fix the problems the tools created. Or you go in and fix things yourself, in a codebase you don't truly understand. A single developer can recreate the contextual nightmare miasma of a large corporation all by themselves.

    There's also an emerging market consideration: MVP are easy to build so time to market is no longer hard to achieve. It's not a differentiator.

    X was built in 3 days but is slow and riddled with bugs and security errors. There are also A, B, C, D and E which are effectively the same thing built just as fast.

    Z was built over six months and is rock solid and performant.

    Who wins the market share?

    • Who's got better marketing? Is it even a product that customers care about rock solid and performant? Which ones cheaper and has the least friction to getting started? Which one's CEO golfs with your company's CEO?

      Time and time again, the market proves worse is better, from the format wars of the 80's and 90's, to Microsoft Windows still being dominant (and oh yeah, Teams). Sometimes quality does win, but if being built in 3 days means they can make a profit charging 1/100th the price of Z, I wouldn't count the cheap ones out of the game just because Z is better.

      1 reply →

  • Personally my ability to understand atrophies / is reduced when compared to writing code ‘myself’ rather than fully being a reviewer.

    Probably similar to hand writing notes (while digesting + synthesizing and not just being a scribe) vs reading notes somebody else took.

    • I'm guessing there's some science or research behind this, but I agree. Similarly, I've had projects where I did everything fairly solo—programmed, designed ux/ui, maybe validated with users, etc. It was significantly harder, particularly in the phase where you're working between the first two and the idea isn't perfectly set. It worked much better to design, then build in explicit steps, but it was so easy to start coding, have the design looking and feeling okay, then start iterating on the design—but iterating in code rather than Figma or wherever. It's fine for a little while, but you realize you've spent a day (maybe more) doing it in this less efficient way.

      It's similar to the 80/20 rule. When you're coding and designing from the hip, you'll do pretty well for awhile, but as you near completion, you can't quite tie up all the loose design ends. That's the part where it's probably better to just design fully to 100% first and then build, which is closer to what happens when the roles are separate. At least in my experience. I will say though that that part where you're designing in code (productively or wastefully) is pretty fun. At least until you hit the wall and get frustrated with how often you've deleted and rewrote the same thing ten times.

  • > Domain expertise combined with a QA mindset could replace SWE, but consistent QA mindset is rare

    I've heard this story at least 3 times already:

    - Domain expertise combined with outsource could replace expensive US SWE

    - Domain expertise combined with SWE could replace QA

    - Domain expertise combined with SWE could replace infra engineers

    Why is everyone so preoccupied with replacing someone with someone instead of doing their fucking job?

  • You can't test quality into a product. Regardless of how much of a "QA mindset" you have, you can only ever find a fraction of defects and technical debt through external testing. This can be good enough for a throwaway app that will only be used by a limited customer base for a limited time. But that approach quickly bogs down if you try to scale it into a product that will be used indefinitely by a huge set of external customers. At some point velocity drops to near zero because the code base is such a mess that it becomes impossible to add new features without causing regression defects or breaking backward compatibility.

  • The engineering part of software engineering is the hard part for LLMs. How is that replaceable with these skills?

  • I don’t think so. Most things are sufficiently complicated enough to require multiple domain experts working together to achieve a goal.

    The dunning kruger effect is in full swing as people think AI replaces the domain expert need.

    Most of the value in the expert isnt the 80% but the tail 20% or 10% where AI fails. For a one of personal app or website, 80% is plenty but only that.