Comment by f311a

1 day ago

How many more months do we need to wait, until big companies realize that flash models work just fine if you:

1) Don't ask LLMs for big changes

2) Review everything and point them in the right direction

Large models still suck at big changes, they produce questionable architecture and you still have to review the code, if your project is serious enough.

The codebase quickly become a mess, if you don't pay enough attention. Does not matter which model.

So why bother with big models, when flash models are 10x cheaper and much faster to iterate under guidance? Large models can be used for security and bug audits. Flash models work almost the same for changes under 300 LOC when you dictate how you want your code to look.

It's pretty simple; organizations are willing to tolerate paying $1500/month/engineer, which seems to be roughly inline with "normal" consumption for most full-time engineers. If that number grows significantly, then I bet companies will start exploring flash models more, as you propose.

  • They are willing to tolerate it now, which is quite a switch up from the free for all we had a few weeks ago, and if they aren’t able to tie in this new ~$1500p/m cap to demonstrable productivity and revenue increases then that will be kneecapped even faster

    • There are plenty of expenses in this order of magnitude that are not tied to direct increases in productivity. I think it may become a serious hiring impediment for companies to be really skimpy on these budgets for example.

      1 reply →

    • I mean we saw this with cloud spending and especially with logging and database read write cost across numerous companies.

      It’s a clear pattern in service delivery for software for a while now. Hell for many goods and services in general, like Uber rides themselves.

      Start cheap, get some vendor lock in, service provider reduces discounts, consumer notices and then reacts to the price by reducing consumption.

  • > organizations are willing to tolerate paying $1500/month/engineer

    One organization, that is a software company

    > which seems to be roughly inline with "normal" consumption for most full-time engineers

    My peers are using $20/mo plans, only a handful are using more than $100/mo in tokens. We haven’t had any limits imposed yet.

  • Which organizations?

    Uber is not representative of any trend beyond big tech and VC over funded startups.

The easy decision is to just go with the biggest SOTA model you can afford.

But this overlooks the other critical part of getting the most out of these things: the harness. I run an autonomous plan/design/code/build/test pipeline with agents using my own orchestrator. Different models are better at different stages, and I use LLMs to judge the output between them. Not everything needs Opus 4.8.

The harness provides both the scaffolding to get the right things into the model, and the right things out. But it also lets you dictate which model does which work.

It's the pipeline, not the model, that gets you quality at a given token budget.

There is something about using the most advanced tooling possible. Why would you pay for IntelliJ, if Eclipse can do the same thing a bit worse?

You want to master your craft, develop "optimal" systems, understand where things are going by utilizing SOTA.

You can call it FOMO, but you get the point.

opus to produce workflows, flash 3.5 to do them.

Chinese models prob work too, but idk since i cant use them at work

Is your argument that $1500 / mo is too much? Why would the engineering team not be more rigorous in their model selection given a constraint?

  • If you had a business task to complete that was only possible with ai and it cost you >$1500/month of work, how long would you have to delay the task so that it's cheaper long run to buy hardware and do local models?

    $1,500/mo * 14 months = $21,000.

    If local models are 14mo behind as many in HN say it may be profitable to just wait. Maybe just spend a few hundred dollars of your tokens and buy hardware piece by piece.

    • Nearly no one is doing anything that is “only possible with AI”. This doesn’t seem like a relevant calculation. People spend on AI as an investment in their current productivity.

    • It also presupposes that open models will bridge that gap towards opus4.5, which was really when I drank the AI coding koolaid

I wonder to what extent models should figure out which model to forward a query to. Or perhaps the big models could learn the difference between an easy and a hard question and charge accordingly? Perhaps, if it can measure complexity, even generate a quote?

Small models are fine for small coding tasks but I don't see why big ones can't be broken down most of the time.

  • Many harnesses do this, I've recently dropped all my big subscriptions for using deepseek. Codewhale (formerly deepseek-tui) will use pro for large tasks and route smaller ones to flash. It's pretty good, but I just use pro and everything as the cost is quite low.

    This one does not have routing, but reasonix is insane, absolutely insane for saving money. I've used 1.3billion tokens at the cost of 4$. (99-100% cache hit)

  • > I wonder to what extent models should figure out which model to forward a query to. Or perhaps the big models could learn the difference between an easy and a hard question and charge accordingly?

    This sounds like something a harness could do (and might already be doing), with work delegated to subagents running on lower-cost models.

I'm legit annoyed at opus 4.8 at any setting above 4.8.

I believe it can be great for vibe coding, but mundane day work? Hell no, I'd rather work with Haiku. It's too slow, checks too many things, it's annoying as hell.